I just finished releasing the wikimedia extra
https://github.com/wikimedia/search-extra Elasticsearch plugin which
contains support for trigram accelerated regular expressions similar
Its far from perfect and going to be less efficient the normal full text
search but if you absolutely need to search against arbitrary stuff it
gets the job done reasonably fast.
You can try it on our beta site
Leading repetitions cause trouble for the highlighter at this point but
that is a problem with another plugin
Define reasonably fast:
Running a regular expression across all ~200,000 articles, templates, talk
pages, etc in simple wikipedia in our (not very fast beta) environment
takes around 15 seconds. Using this plugin it takes around 1 second.
Full disclosure: I'm not very good at this kind of thing so I'm sure the
algorithms aren't as efficient as they could be but it gets the job done.
How it works:
- Compiles the regex to an automaton using Lucene's RegExp class.
- Uses the process described in the pdf linked above as "PostgreSQL's
Implementation" to convert the regex into a ngram automaton.
- Breaks cycles in the automaton which would break step 4.
- Converts the automaton to an expression language.
- Simplifies the expression.
- Converts the expression into a Lucene filter (usually a boolean filter
containing term filters).
- Uses the filter as a first pass.
- Rechecks the regex.
More work to do:
Right now the plugin is very aggressive about checking all the terms it can
extract from the regex. It'd likely be faster to ignore terms that aren't
very selective and/or execute them in order of must to least selective
stopping early if the number of candidates dips below a certain point.
This targets the 1.3 branch of Elasticsearch.
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firstname.lastname@example.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2D1Xc-G6Q3_g2RPP_SxZLSK9XZn1ZiZDrLF2VuOxxFWg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.