Trigram-accelerated regex searches

Erik_Rose_2 · May 21, 2014, 10:01pm

I'm trying to move Mozilla's source code search engine (dxr.mozilla.org)
from a custom-written SQLite trigram index to ES. In the current production
incarnation, we support fast regex (and, by extension, wildcard) searches
by extracting trigrams from the search pattern and paring down the
documents to those containing said trigrams. That gives us a manageable
number of docs to then run the actual regex against. It performs very well.
For example, to match the regex .*[cr]attle, we filter the corpus down to
docs containing (("rat" OR "cat") AND "att" AND "ttl" AND "tle") and then
run the full regex against those candidates.

It seems like ES should be able to do this handily, but adding the trigram
filters doesn't make it any faster. First, here's the naive approach, using
an unaccelerated wildcard query. This should have to scan the world:

# Brute-force wildcard search
curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{
    "query": {
        "constant_score": {
            "query": {
                "wildcard": {"content": "*Children*Next*"}
            }
        }
    }
}'

Then, we add trigram filters in an attempt to accelerate matters. The
filters alone take only 80ms to run and return 100 docs, each about 100
chars long, so I would expect running a wildcard query over those to be
scarcely noticeable. However, this query still takes 500ms, the same as the
above:

curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{
    "query": {
        "filtered": {
            "query": {
                "constant_score": {
                    "query": {
                        "wildcard": {"content": "*Children*Next*"}
                    }
                }
            },
            "filter": {
                "and": [
                    {
                        "query": {
                            "match_phrase": {
                                "content_trg": "Children"
                            }
                        }
                    },
                    {
                        "query": {
                            "match_phrase": {
                                "content_trg": "Next"
                            }
                        }
                    }
                ]
            }
        }
    }
}'

(By using match_phrase, I'm counting on the query analyzer to break up
"Children" and "Next" into trigrams and bang them against the trigram
index, using the stored term positions to further pare down the found docs.
Judging by speed, it appears to work very well.)

Then I thought, "Perhaps ES needs a hint to run the wildcard query last."
But no, using post_filter is just as slow:

curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{
    "query": {
        "filtered": {
            "query": {
                "match_all": {}
            },
            "filter": {
                "and": [
                    {
                        "query": {
                            "match_phrase": {
                                "content_trg": "Children"
                            }
                        }
                    },
                    {
                        "query": {
                            "match_phrase": {
                                "content_trg": "Next"
                            }
                        }
                    }
                ]
            }
        }
    },
    "post_filter": {
        "query": {
            "wildcard": {"content": "*Children*Next*"}
        }
    }
}'

Is it possible to coax ES into doing trigram-accelerated wildcard or regex
searching? What am I missing?

Here are the relevant part of my mapping:

"settings": {
    "analysis": {
        "analyzer": {
            # A lowercase trigram analyzer. This is probably good
            # enough for accelerating regexes; we probably don"t
            # need to keep a separate case-senitive index.
            "trigramalyzer": {
                "filter": ["lowercase"],
                "tokenizer": "trigram_tokenizer"
            }
        },
        "tokenizer": {
            "trigram_tokenizer": {
                "type": "nGram",
                "min_gram": 3,
                "max_gram": 3
            }
        }
    }
},
"mappings": {
    'line': {  # One line of source code
        "_all": {
            "enabled": False
        },
        "properties": {
            "content": {
                "type": "string",
                "index": "not_analyzed"
            },
            "content_trg": {
                "type": "string",
                "analyzer": "trigramalyzer"
            }
        }
    }
}

Many thanks for any ideas!

Erik Rose

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/37fa0419-e0c6-48d1-878c-a38cf470aa4b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Erik_Rose_2 · May 22, 2014, 6:09pm

Martijn took a swing at it just now. He eliminated any scoring-based
slowdown, like so (constant_score_filter)…

curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{
    "query": {
        "filtered": {
            "query": {
                "match_all": {}
            },
            "filter": {
                "and": [
                    {
                        "query": {
                            "match_phrase": {
                                "content_trg": "Children"
                            }
                        }
                    },
                    {
                        "query": {
                            "match_phrase": {
                                "content_trg": "Next"
                            }
                        }
                    },
                    {
                        "query": {
                            "wildcard": {
                                "content": {
                                    "wildcard": "*Children*Next*",
                                    "rewrite": "constant_score_filter"
                                }
                            }
                        }
                    }
                ]
            }
        }
    }
}'

…but it didn't make any difference. Somehow, the and pipeline isn't
behaving as we expect. Since ES can't provide any more detailed timing
ouput, I guess the next step is to go look at the source code for the and
filter and the wildcard query and see what's what.

I think we'd both be fascinated to know what's going on, if anyone has
anything to add.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3114f40c-0b15-4dd4-8a6b-fc8c13d43f23%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Robert_Muir_2 · May 22, 2014, 6:21pm

On Wed, May 21, 2014 at 6:01 PM, Erik Rose grincheroo@gmail.com wrote:

I'm trying to move Mozilla's source code search engine (dxr.mozilla.org)
from a custom-written SQLite trigram index to ES. In the current production
incarnation, we support fast regex (and, by extension, wildcard) searches by
extracting trigrams from the search pattern and paring down the documents to
those containing said trigrams.

This is definitely a great approach for a database, but it won't work
exactly the same way for an inverted index because the datastructure
is totally different.

In the inverted index queries like wildcards are slow: they must
iterate and match all terms in the document collection, then intersect
those postings with the rest of your query. So because its inverted,
it works backwards from what you expect and thats why adding
additional intersections like 'AND' don't speed anything up, they
haven't happened yet.

N-grams can speed up partial matching in general, but the methods to
accomplish this are different: usually the best way to go about it is
to try to think about Analyzing the data in such a way that the
queries to accomplish what you need are as basic as possible.

The first question is if you really need partial matching at all: I
don't have much knowledge about your use case, but just going from
your example, i would look at wildcards like "ChildrenNext*" and ask
if instead i'd want to ensure my analyzer split on case-changes, and
try to see if i could get what i need with a sloppy phrase query.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZUS40rsAjmzrL_YK6yjgjZRumeQKFVPhVu9bUcW4nN_KA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

mattweber · May 22, 2014, 6:59pm

Leading wildcards are really expensive. Maybe you can try creating a copy
of your "content" field that reverses the tokens using reverse token filter
[1]. By doing this you turn those expensive leading wildcards into
trailing wildcards which should give you better performance. I think your
query would look something like this:

{
"query": {
"constant_score": {
"query": {
"bool": {
"should": [
{"wildcard": {"content": "ChildrenNext"}},
{"wildcard": {"content_rev": "txeNnerdlihC"}}
]
}
}
}
}
}

Note that you will need to reverse your query string as the wildcard query
is not analyzed.

[1]

Thanks,
Matt Weber

On Thu, May 22, 2014 at 11:09 AM, Erik Rose grincheroo@gmail.com wrote:

Martijn took a swing at it just now. He eliminated any scoring-based
slowdown, like so (constant_score_filter)…
curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d
'{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"query": {
"match_phrase": {
"content_trg": "Children"
}
}
},
{
"query": {
"match_phrase": {
"content_trg": "Next"
}
}
},
{
"query": {
"wildcard": {
"content": {
"wildcard": "ChildrenNext*",
"rewrite": "constant_score_filter"
}
}
}
}
]
}
}
}
}'

…but it didn't make any difference. Somehow, the and pipeline isn't
behaving as we expect. Since ES can't provide any more detailed timing
ouput, I guess the next step is to go look at the source code for the and
filter and the wildcard query and see what's what.

I think we'd both be fascinated to know what's going on, if anyone has
anything to add.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3114f40c-0b15-4dd4-8a6b-fc8c13d43f23%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/3114f40c-0b15-4dd4-8a6b-fc8c13d43f23%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Thanks,
Matt Weber

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoA1fQjbkygEBhxZdMcb%3D22JGDph65qNn1cvkE66NLRn3A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Erik_Rose_2 · May 22, 2014, 7:27pm

Leading wildcards are really expensive. Maybe you can try creating a copy
of your "content" field that reverses the tokens using reverse token filter
[1].

Good advice, typically, but notice I have wildcards on either side.
Reversing just makes the trailing wildcard expensive.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6b982d49-5f3b-472a-84ea-cfd4f2c662d6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Itamar_Syn_Hershko · May 22, 2014, 7:48pm

Aye, and then you can use edit distance on single words (fuzzy query) to
cope with fast typers
On May 22, 2014 8:22 PM, "Robert Muir" robert.muir@elasticsearch.com
wrote:

On Wed, May 21, 2014 at 6:01 PM, Erik Rose grincheroo@gmail.com wrote:

I'm trying to move Mozilla's source code search engine (dxr.mozilla.org)
from a custom-written SQLite trigram index to ES. In the current
production
incarnation, we support fast regex (and, by extension, wildcard)
searches by
extracting trigrams from the search pattern and paring down the
documents to
those containing said trigrams.

This is definitely a great approach for a database, but it won't work
exactly the same way for an inverted index because the datastructure
is totally different.

In the inverted index queries like wildcards are slow: they must
iterate and match all terms in the document collection, then intersect
those postings with the rest of your query. So because its inverted,
it works backwards from what you expect and thats why adding
additional intersections like 'AND' don't speed anything up, they
haven't happened yet.

N-grams can speed up partial matching in general, but the methods to
accomplish this are different: usually the best way to go about it is
to try to think about Analyzing the data in such a way that the
queries to accomplish what you need are as basic as possible.

The first question is if you really need partial matching at all: I
don't have much knowledge about your use case, but just going from
your example, i would look at wildcards like "ChildrenNext*" and ask
if instead i'd want to ensure my analyzer split on case-changes, and
try to see if i could get what i need with a sloppy phrase query.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZUS40rsAjmzrL_YK6yjgjZRumeQKFVPhVu9bUcW4nN_KA%40mail.gmail.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZugwATMVFH%3DFziTPkX-dT6%3DRGfwhCud2S_aBcSDYmxZEA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Erik_Rose_2 · May 22, 2014, 7:48pm

This is definitely a great approach for a database, but it won't work
exactly the same way for an inverted index because the datastructure
is totally different.

Ah, I was afraid of that. I hoped, due to the field being unanalyzed (and
the documentation's noted restriction that wildcard queries work only on
unanalyzed fields) that the wildcard query would apply to the fetched field
contents rather than crawling the whole index. But your theory certainly
fits the timings I've been getting. Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3d523ff3-5292-4ac5-9ec8-7df8520de58b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Erik_Rose_2 · May 22, 2014, 8:31pm

Alright, try this on for size.

Since the built-in regex-ish filters want to be all clever and index-based,
why not use the JS script plugin, which is happy to run as a
post-processing phase?

curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{
    "query": {
        "filtered": {
            "query": {
                "match_all": {}
            },
            "filter": {
                "and": [
                    {
                        "query": {
                            "match_phrase": {
                                "content_trg": "Children"
                            }
                        }
                    },
                    {
                        "query": {
                            "match_phrase": {
                                "content_trg": "Next"
                            }
                        }
                    },
                    {
                        "script": {
                            "lang": "js",
                            "script": "(new

RegExp(pattern)).test(doc["content"].value)",
"params": {
"pattern": "Children.*Next"
}
}
}
]
}
}
}
}'

That gets me through the whole 16M-doc corpus in 117ms. (Without the
match_phrase queries, it takes forever, at 12s, so you can see the trigrams
acceleration working.) I am ecstatic.

Some of you might note that the pattern doesn't begin or end with a
wildcard; that's because RegExp.test() serves as a search rather than a
match, so wildcards are effectively assumed.

Cheers!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6a7674d8-bf51-4be6-860e-589db3939ea8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · September 29, 2014, 6:09pm

On Thu, May 22, 2014 at 4:31 PM, Erik Rose grincheroo@gmail.com wrote:

Alright, try this on for size.

Since the built-in regex-ish filters want to be all clever and
index-based, why not use the JS script plugin, which is happy to run as a
post-processing phase?
curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d
'{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"query": {
"match_phrase": {
"content_trg": "Children"
}
}
},
{
"query": {
"match_phrase": {
"content_trg": "Next"
}
}
},
{
"script": {
"lang": "js",
"script": "(new
RegExp(pattern)).test(doc["content"].value)",
"params": {
"pattern": "Children.*Next"
}
}
}
]
}
}
}
}'

That gets me through the whole 16M-doc corpus in 117ms. (Without the
match_phrase queries, it takes forever, at 12s, so you can see the trigrams
acceleration working.) I am ecstatic.

Some of you might note that the pattern doesn't begin or end with a
wildcard; that's because RegExp.test() serves as a search rather than a
match, so wildcards are effectively assumed.

Cheers!

Sorry to revive a super old thread but its pretty relevant. I (sometimes)
need the super duper exact matching that you can get out of running regular
expressions against the source similarly to the proposal above. I
currently have it implemented as a post filter as well. The problem is
that its too easy to put together a query that has to visit millions of
documents and even if you always write fast queries you can get in line
behind users with slow queries. So the feature is pretty broken as is. So
I took a stab at writing a trigram accelerated regex plugin similar to
pg_trgm but, well, for Elasticsearch. Its certainly not the most efficient
thing in the world, but its way way more efficient (in the worst case) then
post filtering.

Right now its just a proposal that lives here:
https://gerrit.wikimedia.org/r/#/c/163268 . My computer science is pretty
rusty so I'm sure I've made some mistakes with the automaton and logic work
but it does work - even for pretty complex regexes - but that is certainly
an area for future improvement. Another area is in only executing some of
the filters - right now all the logic from the regex is extracted into a
tree of bool filters and then they are all executed. You could save time
by executing the term filters iteratively until there are only some number
of documents to recheck and then just recheck them.

Anyway, this is a start. Its not meant to replace real language analysis
but if you need exact matches like those from regular expressions with
a reasonably degree of efficiency.

Please have a look at the code (https://gerrit.wikimedia.org/r/#/c/163268)
and comment on it. At some point I'll release it as an Elasticsearch
plugin.

Thanks for reading,

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd125WSuh%2BXicrjRUU9xhCYaj%2BhHoZKQPWpECXJo5QXPog%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.