I'm trying to move Mozilla's source code search engine (dxr.mozilla.org)
from a custom-written SQLite trigram index to ES. In the current production
incarnation, we support fast regex (and, by extension, wildcard) searches
by extracting trigrams from the search pattern and paring down the
documents to those containing said trigrams. That gives us a manageable
number of docs to then run the actual regex against. It performs very well.
For example, to match the regex .*[cr]attle, we filter the corpus down to
docs containing (("rat" OR "cat") AND "att" AND "ttl" AND "tle") and then
run the full regex against those candidates.
It seems like ES should be able to do this handily, but adding the trigram
filters doesn't make it any faster. First, here's the naive approach, using
an unaccelerated wildcard query. This should have to scan the world:
# Brute-force wildcard search
curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{
"query": {
"constant_score": {
"query": {
"wildcard": {"content": "*Children*Next*"}
}
}
}
}'
Then, we add trigram filters in an attempt to accelerate matters. The
filters alone take only 80ms to run and return 100 docs, each about 100
chars long, so I would expect running a wildcard query over those to be
scarcely noticeable. However, this query still takes 500ms, the same as the
above:
curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{
"query": {
"filtered": {
"query": {
"constant_score": {
"query": {
"wildcard": {"content": "*Children*Next*"}
}
}
},
"filter": {
"and": [
{
"query": {
"match_phrase": {
"content_trg": "Children"
}
}
},
{
"query": {
"match_phrase": {
"content_trg": "Next"
}
}
}
]
}
}
}
}'
(By using match_phrase, I'm counting on the query analyzer to break up
"Children" and "Next" into trigrams and bang them against the trigram
index, using the stored term positions to further pare down the found docs.
Judging by speed, it appears to work very well.)
Then I thought, "Perhaps ES needs a hint to run the wildcard query last."
But no, using post_filter is just as slow:
curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"query": {
"match_phrase": {
"content_trg": "Children"
}
}
},
{
"query": {
"match_phrase": {
"content_trg": "Next"
}
}
}
]
}
}
},
"post_filter": {
"query": {
"wildcard": {"content": "*Children*Next*"}
}
}
}'
Is it possible to coax ES into doing trigram-accelerated wildcard or regex
searching? What am I missing?
Here are the relevant part of my mapping:
"settings": {
"analysis": {
"analyzer": {
# A lowercase trigram analyzer. This is probably good
# enough for accelerating regexes; we probably don"t
# need to keep a separate case-senitive index.
"trigramalyzer": {
"filter": ["lowercase"],
"tokenizer": "trigram_tokenizer"
}
},
"tokenizer": {
"trigram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
}
}
}
},
"mappings": {
'line': { # One line of source code
"_all": {
"enabled": False
},
"properties": {
"content": {
"type": "string",
"index": "not_analyzed"
},
"content_trg": {
"type": "string",
"analyzer": "trigramalyzer"
}
}
}
}
Many thanks for any ideas!
Erik Rose
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/37fa0419-e0c6-48d1-878c-a38cf470aa4b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.