apache jena arq 模块扩展

ARQ - Extending Query Execution

jena通过实现Graph 接口可以扩展到使用新的存储或者访问non-rdf数据。重写graph的find就可以实现在只读数据库上的访问。

arq query processing一共有6个步骤:parsing, algebra generation, execution building, high-level optimization, low-level optimization and finally evaluation.

Parsing

query string -> Query object. Query类用抽象语法树AST表示一个query并且提供了创建AST的方法。

Algebra generation

将query obj转化成SPARQL algebra expression

High-Level Optimization and Transformations

对algebra op做一系列的transformation。用户可以自己扩展这些transfermation

1
2
3
4
Op op = ... ;
Transform someTransform = ... ;
op = Transformer.transform(someTransform, op) ;

transformation 会自底向上的处理algebra。

Low-Level Optimization and Evaluation

Low-level Optimization 的职责之一是选择query的执行顺序。

Query Engines and Query Engine Factories

从algebra的简历到query的执行都是通过QueryExecution.execSelect或者QueryExection的其他exec操作。QueryExecution执行的过程中会生成一个queryEngine。

query engine是一种自顶向下的执行模式。当query execution factory得到一个dataset 和query时,他会通过accept函数去看那些registered engine factory可以执行这个query。当选定一个queryEngine时,通过create方法来创在一个plan对象.通过plan就可以得到query的queryIterator

Main Query Engine

Main query engine可以执行任何query。当初始化完成时,它调用QC.execute来执行一个query。任何扩展如何向要重用main query engine则需要用自己的Opexecutor
调用这个QC.execute来执行sub-query。QC.execute会生成一个OpExecutor object并且用它执行一个algebra operation

扩展main Query engine有两种方法:

  1. Stage generators, 用来执行basic graph patterns 并且重用engine
  2. OpExecutor来执行特定的operator。

Stage generator

StageGenerator的优点在于相比OpExecutor需要了解较少的细节。只需要在context中设置set(ARQ.stageGenerator, stageGenAlt);即可

BGP的默认处理方式是调用OpExecutor.executor(OpBGP,QueryIterator),他会调用StageGenerator.execute(BasicPattern pattern, QueryIterator input, ExecutionContext execCxt).用户设置自定义的StageGenerator可以用如下方式设置:

1
2
3
4
5
6
// Get the standard one.
StageGenerator orig = (StageGenerator)ARQ.getContext().get(ARQ.stageGenerator) ;
// Create a new one
StageGenerator myStageGenerator= new MyStageGenerator(orig) ;
// Register it
StageBuilder.setGenerator(ARQ.getContext(), myStageGenerator) ;

不过我看2.12的代码里OpExecutor.executor是这样的

1
2
3
4
5
6
7
protected QueryIterator execute(OpBGP opBGP, QueryIterator input) {
BasicPattern pattern = opBGP.getPattern() ;
QueryIterator qIter = stageGenerator.execute(pattern, input, execCxt) ;
if (hideBNodeVars)
qIter = new QueryIterDistinguishedVars(qIter, execCxt) ;
return qIter ;
}

所以设置context的stageGenerator即可:

1
2
3
4
QueryExecution qexec = QueryExecutionFactory.create(query, dataset.toDataset())) {
StageGenerator origStageGen = (StageGenerator) qexec.getContext().get(ARQ.stageGenerator);
StageGenerator stageGenAlt = new StageGeneratorKG(origStageGen, dataset.getKGSchema());
qexec.getContext().set(ARQ.stageGenerator, stageGenAlt);

对于自定义的StageGenerator,推荐做法是做成chain的形式,如果不是自己定义的graphClass,则传给上层StageGenerator处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public class MyStageGenerator implements StageGenerator
{
StageGenerator above = null ;
public MyStageGenerator (StageGenerator original)
{ above = original ; }
@Override
public QueryIterator execute(BasicPattern pattern, QueryIterator input, ExecutionContext execCxt)
{
Graph g = execCxt.getActiveGraph() ;
// Test to see if this is a graph we support.
if ( ! ( g instanceof MySpecialGraphClass ) )
// Not us - bounce up the StageGenerator chain
return above.execute(pattern, input, execCxt) ;
MySpecialGraphClass graph = (MySpecialGraphClass )g ;
// Create a QueryIterator for this request
...

OpExecutor

如果需要扩展的op不只是bgp的,那么需要自己扩展OpExecutor。通常步骤为

  1. 继承已有的OpExecutor,并实现特殊的QueryIterator executor()。
  2. 在QC或者ExecutionContext中中注册一个OpExecutorFactor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
//1 QC中注册
OpExecutorFactory customExecutorFactory = new MyOpExecutorFactory(...) ;
QC.setFactory(ARQ.getCOntext(), customExecutorFactory) ;
//2 executionContext中注册
// Execute an operation with a different OpExecution Factory
// New context.
ExecutionContext ec2 = new ExecutionContext(execCxt) ;
ec2.setExecutor(plainFactory) ;
QueryIterator qIter = QC.execute(op, input, ec2) ;
private static OpExecutorFactory plainFactory =
new OpExecutorFactory()
{
@Override
public OpExecutor create(ExecutionContext execCxt)
{
// The default OpExecutor of ARQ.
return new OpExecutor(execCxt) ;
}
} ;

Dataset, model, graph的区别

Jena is divided into an API, for application developers, and an SPI for systems developers, such as people making storage engines, reasoners etc.

DataSet, Model, Statement, Resource and Literal are API interfaces and provide many conveniences for application developers.

DataSetGraph, Graph, Triple, Node are SPI interfaces. They’re pretty spartan and simple to implement (as you’d hope if you’ve got to implement the things).

  • A DataSource is a collection of models (one being the Default Model, any others being Named Models) that you expect will have new triples added to it over time. You can read and write on DataSources.
  • A Dataset is like a DataSource, but its triples are static - you don’t expect new ones to be added or existing ones to be deleted. These guys are read-only.
  • A Model is a collection of statements- this is what you typically aim your SPARQL queries at. If you SPARQL a DataSource or Dataset and don’t use a ‘FROM NAMED’ clause, you’re querying the Default Model.
  • A Graph is a collection of triples. Every Model can be turned into a Graph, to provide a somewhat closer representation of the RDF, OWL, and SPARQL standards.
  • A DatasetGraph is a container for Graphs, similar to a DataSource (i.e read/write), that provides the infrastructure for Default and Named Graphs

本文采用创作共用保留署名-非商业-禁止演绎4.0国际许可证,欢迎转载,但转载请注明来自http://thousandhu.github.io,并保持转载后文章内容的完整。本人保留所有版权相关权利。

本文链接:http://thousandhu.github.io/2016/04/20/apache-jena-arq-模块扩展/