Skip to content

Overview over the QLever codebase

Hannah Bast edited this page Dec 17, 2024 · 1 revision

This page should provide an overview over the large QLever codebase, as an entry assistance for developers joining the project. It is currently still a stub, but it's a start.

Turtle parsing and tokenization

QLever's current Turtle parser is hand-written (not generated via a parser generator like for the SPARQL parser, which uses ANTLR). The main code is in src/parser/RdfParser.{h,cpp}. It uses functions like bool TurtleParser::iriref(), which try to parse a piece of the grammar (an IRIREF in this case) from the current input and return true if they succeed (in which case lastParseResult_ is updated and the respective part is removed from the input) and false otherwise. The classes are:

template <class Tokenizer> class TurtleParser : public RdfParserBase
template <class Tokenizer> class RdfMultifileParser : public RdfParserBase
template <class Tokenizer> class NQuadParser : public TurtleParser<Tokenizer>
template <typename Parser> class RdfStringParser : public Parser  
template <typename Parser> class RdfStreamParser : public Parser 
template <typename Parser> class RdfParallelParser : public Parser 

Note that TurtleParser sets UseRelaxedParsing to true iff Tokenizer == TokenizerCtre iff ascii-prefixes-only == true in the settings.jons file.

The code for tokenization is in src/parser/Tokenizer.{h,cpp}. The struct TurtleToken holds the regexes for the various tokens, e.g. Dot = grp("\\."), where grp puts (...) around its argument (and cls puts [...] around its argument).

Another example is PnLocal = grp(PnLocalString), where PnLocalString is ([%BASE%_:0-9]|%[0-9A-Fa-f]{2}|\\[_~.\-!$&'()*+,;=/?#@%])(\.*([%BASE%_\-0-9\x{00B7}\x{0300}-\x{036F}\x{203F}-\x{2040}:]|%[0-9A-Fa-f]{2}|\\[_~.\-!$&'()*+,;=/?#@%]))*