-
Notifications
You must be signed in to change notification settings - Fork 0
obartunov/dict_regex
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Latest commit | ||||
Repository files navigation
Regular expressions dictionary for PostgreSQL ============================================= This code is a dictionary for PostgreSQL full-text search, which combines the generic synonym and thesaurus ones with the power of regular expressions. * Configuration The dictionary accepts one obligatory parameter - RULES, which has to be set to the file describing parsing rules and output recipes. Its syntax is very simple. Each non-empty line is either a comment (when started with '#' symbol), or a conversion rule, and in the latter case has to contain whitespace-separated pattern and output recipe. Currently, pattern has to match the _integer_ (one, two or more) number of input tokens; they will be replaced with a single one. All whitespaces between input words are treated as single one. If the output recipe is empty, the match is considered to be a stop-word, and is not processed by the dictionaries next in line. The pattern is explicitly enclosed by the dictionary inside "^...$" construct, so you need not include these. If several rules are matching the input stream, the longest one will be accepted, and if there are several - the one which is earlier in the rules. The pattern syntax is basically the one of PCRE, and is compatible to Perl. The one exception is the following (related to the PCRE working in partial matching mode): repeated single characters such as "a{2,4}" and repeated single metasequences such as "\d+" are not permitted if the maximum number of occurrences is greater than one. Optional items such as "\d?" (where the maximum is one) are permitted. Quantifiers with any values are permitted after parentheses, so the invalid examples above can be coded as "(a){2,4}" and "(\d)+", correspondingly. Also, as for now, only 9 matched substrings are supported (**$1** - **$9**). They will be translated to lower case along with all the output recipe. If output recipe contains several words, it will be returned as a set of tokens with the same position, i.e. they will be treated as the synonyms. * Usage 1. Compile and install it. The compilation requires PCRE (http://pcre.org) headers and library in compiler-accessible paths; in a case, tweak the Makefile according to your settings 2. Load the dictionary definitions into your database psql qq < dict_regex.sql 3. Create the conversion rules, and point the dictionary to the file qq# ALTER TEXT SEARCH DICTIONARY regex (RULES = '@your_rules_file@'); ALTER TEXT SEARCH DICTIONARY 4. Use it as you wish * Examples Some basic example rules: # Synonym expansion - 'catalogue' will be indexed both as is and as 'catalog' # on the query side, it means 'catalogue' -> 'catalogue' & 'catalog' catalogue catalogue catalog # Thesaurus - index 'regular expression' as single token regular\sexpressions? regex # Complex behaviour - transpose parts of a 8-digit number, 12345678 -> 56781234 (\d\d\d\d)(\d\d\d\d) $2$1 * Todo - Improve handling of config file - allow whitespaces in patterns and in recipes - Improve recipe parsing - allow more than 9 matches per pattern - Improve performance of pattern matching - do not reparse already processed parts of incoming string - Think about integration with other dictionaries (like thesaurus does) for normalization before pattern matching - Implement complex expansion of 'synonyms' - i.e. rule options to control complex AND and OR behaviour of returned lexemes. - Replace (or optionally replace) PCRE with PostgreSQL built-in regular expressions engine * Security notice Be warned that dict_regex is insecure - id does not perform any special checks on the location and content of the rule file. So, it may be pointed to any file readable by postmaster, and it is possible to user to reconstruct the file contents. Do not use it in security-critical environments, or modify the source accordingly.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published