[ANN] Released trigram accelerated regex queries for Elasticsearch version 0.0.1

I just finished releasing the wikimedia extra
https://github.com/wikimedia/search-extra Elasticsearch plugin which
contains support for trigram accelerated regular expressions similar
to PostgreSQL's
implementation
http://wiki.postgresql.org/images/6/6c/Index_support_for_regular_expression_search.pdf.
Its far from perfect and going to be less efficient the normal full text
search but if you absolutely need to search against arbitrary stuff it
gets the job done reasonably fast.

You can try it on our beta site
http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=insource%3A%2Fattend(ee)%3F%2F&fulltext=Search.
Leading repetitions cause trouble for the highlighter at this point but
that is a problem with another plugin
https://github.com/wikimedia/search-highlighter/issues/7
https://github.com/wikimedia/search-highlighter/issues/7.

Define reasonably fast:
Running a regular expression across all ~200,000 articles, templates, talk
pages, etc in simple wikipedia in our (not very fast beta) environment
takes around 15 seconds. Using this plugin it takes around 1 second.

Full disclosure: I'm not very good at this kind of thing so I'm sure the
algorithms aren't as efficient as they could be but it gets the job done.

How it works:

  1. Compiles the regex to an automaton using Lucene's RegExp class.
  2. Uses the process described in the pdf linked above as "PostgreSQL's
    Implementation" to convert the regex into a ngram automaton.
  3. Breaks cycles in the automaton which would break step 4.
  4. Converts the automaton to an expression language.
  5. Simplifies the expression.
  6. Converts the expression into a Lucene filter (usually a boolean filter
    containing term filters).
  7. Uses the filter as a first pass.
  8. Rechecks the regex.

More work to do:
Right now the plugin is very aggressive about checking all the terms it can
extract from the regex. It'd likely be faster to ignore terms that aren't
very selective and/or execute them in order of must to least selective
stopping early if the number of candidates dips below a certain point.
Maybe.

This targets the 1.3 branch of Elasticsearch.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2D1Xc-G6Q3_g2RPP_SxZLSK9XZn1ZiZDrLF2VuOxxFWg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.