Hi,
I have a string field that is analyzed by NGram analyzer.
Let's say the name of the field is 'id' and its value is [a-z]+
I use NGram analyzer because I want to search any substring of ids.
Here's an example:
curl -XPUT 'http://localhost:9200/test/' -d
'{"index":{"analysis":{
"analyzer":{"ngram":{"type":"custom","tokenizer":"myngram","filter":["lowercase"]}},
"tokenizer":{"myngram":{"type":"ngram","min_gram":1,"max_gram":3}}}}}'
curl -XPUT 'http://localhost:9200/test/type1/_mapping' -d
'{"type1":{"properties":{"id":{"type":"string","index":"analyzed","analyzer":"ngram"}}}}'
curl -XPUT localhost:9200/test/type1/doc1 -d '{"id":"aaaaaabbbcc"}'
curl -XPUT localhost:9200/test/type1/doc2 -d '{"id":"aaaabbbbbcc"}'
curl -XPUT localhost:9200/test/type1/doc3 -d '{"id":"aaaaaaaabcc"}'
I use constant_score with text query for the search against these docs.
I use constant_score because I don't need score and I think it speeds up the search.
curl -XGET localhost:9200/test/type1/_search -d
'{"query":{"constant_score":{"query":{"text":{"id":{"query":"bbb","operator":"and"}}}}}}'
This search works fine and gets doc1 and doc2.
However, if I search for a series of the same letters whose length is more than max_gram,
I get false hits.
For example (4 'b's),
curl -XGET localhost:9200/test/type1/_search -d
'{"query":{"constant_score":{"query":{"text":{"id":{"query":"bbbb","operator":"and"}}}}}}'
returns doc1 (3 'b's, false hit) and doc2 (4 'b's).
This is understandable by nature of NGram analyzer.
So, I came up with an idea to apply a filter to filter out false hits.
I changed the mapping so that the id field is a multi_field and has "id" (NGram'ed) and "id.raw" (not analyzed).
curl -XDELETE 'http://localhost:9200/test'
curl -XPUT 'http://localhost:9200/test/' -d
'{"index":{"analysis":{
"analyzer":{"ngram":{"type":"custom", "tokenizer":"myngram","filter":["lowercase"]}},
"tokenizer":{"myngram":{"type":"ngram","min_gram":1,"max_gram":3}}}}}'
curl -XPUT 'http://localhost:9200/test/type1/_mapping' -d
'{"type1":{"properties":{
"id":{"type":"multi_field","fields":{
"id":{"type":"string","index":"analyzed","analyzer":"ngram"},
"raw":{"type":"string","index":"not_analyzed"}}}}}}'
curl -XPUT localhost:9200/test/type1/doc1 -d '{"id":"aaaaaabbbcc"}'
curl -XPUT localhost:9200/test/type1/doc2 -d '{"id":"aaaabbbbbcc"}'
curl -XPUT localhost:9200/test/type1/doc3 -d '{"id":"aaaaaaaabcc"}'
Then I can get desired hits with specifying wildcard query together:
curl -XGET localhost:9200/test/type1/_search -d
'{"query":{"constant_score":{"filter":{"and":[
{"query":{"text":{"id":{"query":"bbbb","operator":"and"}}}},
{"query":{"wildcard":{"id.raw":"bbbb"}}}
]}}}}'
So far so good. But questions came up:
Question 1)
I used NGram analyzer for effective substring search.
However, I ended up with using wildcard filter to get rid of undesired hits.
Does this approach ruin my original intention, which is 'fast search'?
Question 2)
The final query seems to be redundant because using solely wildcard filter can do the job.
Is having the first text query contrubutes the performance?
In other words, is the wildcard query filter applied only to the results of the first text query?
Question 3)
Is there better way to get the substring search done?
Thank you for your help.