ARQ - Extending Query Execution
jena通过实现Graph 接口可以扩展到使用新的存储或者访问non-rdf数据。重写graph的find就可以实现在只读数据库上的访问。
arq query processing一共有6个步骤:parsing, algebra generation, execution building, high-level optimization, low-level optimization and finally evaluation.
Parsing
query string -> Query object. Query类用抽象语法树AST表示一个query并且提供了创建AST的方法。
Algebra generation
将query obj转化成SPARQL algebra expression
High-Level Optimization and Transformations
对algebra op做一系列的transformation。用户可以自己扩展这些transfermation
|
|
transformation 会自底向上的处理algebra。
Low-Level Optimization and Evaluation
Low-level Optimization 的职责之一是选择query的执行顺序。
Query Engines and Query Engine Factories
从algebra的简历到query的执行都是通过QueryExecution.execSelect或者QueryExection的其他exec操作。QueryExecution执行的过程中会生成一个queryEngine。
query engine是一种自顶向下的执行模式。当query execution factory得到一个dataset 和query时,他会通过accept函数去看那些registered engine factory可以执行这个query。当选定一个queryEngine时,通过create方法来创在一个plan对象.通过plan就可以得到query的queryIterator
Main Query Engine
Main query engine可以执行任何query。当初始化完成时,它调用QC.execute来执行一个query。任何扩展如何向要重用main query engine则需要用自己的Opexecutor
调用这个QC.execute来执行sub-query。QC.execute会生成一个OpExecutor object并且用它执行一个algebra operation
扩展main Query engine有两种方法:
- Stage generators, 用来执行basic graph patterns 并且重用engine
- OpExecutor来执行特定的operator。
Stage generator
StageGenerator的优点在于相比OpExecutor需要了解较少的细节。只需要在context中设置set(ARQ.stageGenerator, stageGenAlt);
即可
BGP的默认处理方式是调用OpExecutor.executor(OpBGP,QueryIterator)
,他会调用StageGenerator.execute(BasicPattern pattern, QueryIterator input, ExecutionContext execCxt)
.用户设置自定义的StageGenerator可以用如下方式设置:
|
|
不过我看2.12的代码里OpExecutor.executor是这样的
|
|
所以设置context的stageGenerator即可:
|
|
对于自定义的StageGenerator,推荐做法是做成chain的形式,如果不是自己定义的graphClass,则传给上层StageGenerator处理
|
|
OpExecutor
如果需要扩展的op不只是bgp的,那么需要自己扩展OpExecutor。通常步骤为
- 继承已有的OpExecutor,并实现特殊的QueryIterator executor()。
- 在QC或者ExecutionContext中中注册一个OpExecutorFactor
|
|
Dataset, model, graph的区别
Jena is divided into an API, for application developers, and an SPI for systems developers, such as people making storage engines, reasoners etc.
DataSet, Model, Statement, Resource and Literal are API interfaces and provide many conveniences for application developers.
DataSetGraph, Graph, Triple, Node are SPI interfaces. They’re pretty spartan and simple to implement (as you’d hope if you’ve got to implement the things).
- A DataSource is a collection of models (one being the Default Model, any others being Named Models) that you expect will have new triples added to it over time. You can read and write on DataSources.
- A Dataset is like a DataSource, but its triples are static - you don’t expect new ones to be added or existing ones to be deleted. These guys are read-only.
- A Model is a collection of statements- this is what you typically aim your SPARQL queries at. If you SPARQL a DataSource or Dataset and don’t use a ‘FROM NAMED’ clause, you’re querying the Default Model.
- A Graph is a collection of triples. Every Model can be turned into a Graph, to provide a somewhat closer representation of the RDF, OWL, and SPARQL standards.
- A DatasetGraph is a container for Graphs, similar to a DataSource (i.e read/write), that provides the infrastructure for Default and Named Graphs
本文采用创作共用保留署名-非商业-禁止演绎4.0国际许可证,欢迎转载,但转载请注明来自http://thousandhu.github.io,并保持转载后文章内容的完整。本人保留所有版权相关权利。
本文链接:http://thousandhu.github.io/2016/04/20/apache-jena-arq-模块扩展/