Overview over the QLever codebase

This page should provide an overview over the large QLever codebase, as an entry assistance for developers joining the project. It is currently still a stub, but it's a start.

Turtle parsing and tokenization

QLever's current Turtle parser is hand-written (not generated via a parser generator like for the SPARQL parser, which uses ANTLR). The main code is in src/parser/RdfParser.{h,cpp}. It uses functions like bool TurtleParser::iriref(), which try to parse a piece of the grammar (an IRIREF in this case) from the current input and return true if they succeed (in which case lastParseResult_ is updated and the respective part is removed from the input) and false otherwise. The classes are:

template <class Tokenizer> class TurtleParser : public RdfParserBase
template <class Tokenizer> class RdfMultifileParser : public RdfParserBase
template <class Tokenizer> class NQuadParser : public TurtleParser<Tokenizer>
template <typename Parser> class RdfStringParser : public Parser  
template <typename Parser> class RdfStreamParser : public Parser 
template <typename Parser> class RdfParallelParser : public Parser

Note that TurtleParser sets UseRelaxedParsing to true iff Tokenizer == TokenizerCtre iff ascii-prefixes-only == true in the settings.jons file.

The code for tokenization is in src/parser/Tokenizer.{h,cpp}. The struct TurtleToken holds the regexes for the various tokens, e.g. Dot = grp("\\."), where grp puts (...) around its argument (and cls puts [...] around its argument).

Another example is PnLocal = grp(PnLocalString), where PnLocalString is ([%BASE%_:0-9]|%[0-9A-Fa-f]{2}|\\[_~.\-!$&'()*+,;=/?#@%])(\.*([%BASE%_\-0-9\x{00B7}\x{0300}-\x{036F}\x{203F}-\x{2040}:]|%[0-9A-Fa-f]{2}|\\[_~.\-!$&'()*+,;=/?#@%]))*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview over the QLever codebase

Turtle parsing and tokenization

Clone this wiki locally