Trigram-accelerated regex searches

nik9000 · September 29, 2014, 6:09pm

On Thu, May 22, 2014 at 4:31 PM, Erik Rose grincheroo@gmail.com wrote:

Alright, try this on for size.

Since the built-in regex-ish filters want to be all clever and
index-based, why not use the JS script plugin, which is happy to run as a
post-processing phase?
curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d
'{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"query": {
"match_phrase": {
"content_trg": "Children"
}
}
},
{
"query": {
"match_phrase": {
"content_trg": "Next"
}
}
},
{
"script": {
"lang": "js",
"script": "(new
RegExp(pattern)).test(doc["content"].value)",
"params": {
"pattern": "Children.*Next"
}
}
}
]
}
}
}
}'

That gets me through the whole 16M-doc corpus in 117ms. (Without the
match_phrase queries, it takes forever, at 12s, so you can see the trigrams
acceleration working.) I am ecstatic.

Some of you might note that the pattern doesn't begin or end with a
wildcard; that's because RegExp.test() serves as a search rather than a
match, so wildcards are effectively assumed.

Cheers!

Sorry to revive a super old thread but its pretty relevant. I (sometimes)
need the super duper exact matching that you can get out of running regular
expressions against the source similarly to the proposal above. I
currently have it implemented as a post filter as well. The problem is
that its too easy to put together a query that has to visit millions of
documents and even if you always write fast queries you can get in line
behind users with slow queries. So the feature is pretty broken as is. So
I took a stab at writing a trigram accelerated regex plugin similar to
pg_trgm but, well, for Elasticsearch. Its certainly not the most efficient
thing in the world, but its way way more efficient (in the worst case) then
post filtering.

Right now its just a proposal that lives here:
https://gerrit.wikimedia.org/r/#/c/163268 . My computer science is pretty
rusty so I'm sure I've made some mistakes with the automaton and logic work
but it does work - even for pretty complex regexes - but that is certainly
an area for future improvement. Another area is in only executing some of
the filters - right now all the logic from the regex is extracted into a
tree of bool filters and then they are all executed. You could save time
by executing the term filters iteratively until there are only some number
of documents to recheck and then just recheck them.

Anyway, this is a start. Its not meant to replace real language analysis
but if you need exact matches like those from regular expressions with
a reasonably degree of efficiency.

Please have a look at the code (https://gerrit.wikimedia.org/r/#/c/163268)
and comment on it. At some point I'll release it as an Elasticsearch
plugin.

Thanks for reading,

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd125WSuh%2BXicrjRUU9xhCYaj%2BhHoZKQPWpECXJo5QXPog%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.