Better effective substring query idea?

Hi,
I have a string field that is analyzed by NGram analyzer.
Let's say the name of the field is 'id' and its value is [a-z]+
I use NGram analyzer because I want to search any substring of ids.
Here's an example:
curl -XPUT 'http://localhost:9200/test/' -d
'{"index":{"analysis":{
"analyzer":{"ngram":{"type":"custom","tokenizer":"myngram","filter":["lowercase"]}},
"tokenizer":{"myngram":{"type":"ngram","min_gram":1,"max_gram":3}}}}}'
curl -XPUT 'http://localhost:9200/test/type1/_mapping' -d
'{"type1":{"properties":{"id":{"type":"string","index":"analyzed","analyzer":"ngram"}}}}'
curl -XPUT localhost:9200/test/type1/doc1 -d '{"id":"aaaaaabbbcc"}'
curl -XPUT localhost:9200/test/type1/doc2 -d '{"id":"aaaabbbbbcc"}'
curl -XPUT localhost:9200/test/type1/doc3 -d '{"id":"aaaaaaaabcc"}'

I use constant_score with text query for the search against these docs.
I use constant_score because I don't need score and I think it speeds up the search.
curl -XGET localhost:9200/test/type1/_search -d
'{"query":{"constant_score":{"query":{"text":{"id":{"query":"bbb","operator":"and"}}}}}}'

This search works fine and gets doc1 and doc2.
However, if I search for a series of the same letters whose length is more than max_gram,
I get false hits.
For example (4 'b's),
curl -XGET localhost:9200/test/type1/_search -d
'{"query":{"constant_score":{"query":{"text":{"id":{"query":"bbbb","operator":"and"}}}}}}'
returns doc1 (3 'b's, false hit) and doc2 (4 'b's).
This is understandable by nature of NGram analyzer.

So, I came up with an idea to apply a filter to filter out false hits.

I changed the mapping so that the id field is a multi_field and has "id" (NGram'ed) and "id.raw" (not analyzed).
curl -XDELETE 'http://localhost:9200/test'
curl -XPUT 'http://localhost:9200/test/' -d
'{"index":{"analysis":{
"analyzer":{"ngram":{"type":"custom", "tokenizer":"myngram","filter":["lowercase"]}},
"tokenizer":{"myngram":{"type":"ngram","min_gram":1,"max_gram":3}}}}}'
curl -XPUT 'http://localhost:9200/test/type1/_mapping' -d
'{"type1":{"properties":{
"id":{"type":"multi_field","fields":{
"id":{"type":"string","index":"analyzed","analyzer":"ngram"},
"raw":{"type":"string","index":"not_analyzed"}}}}}}'
curl -XPUT localhost:9200/test/type1/doc1 -d '{"id":"aaaaaabbbcc"}'
curl -XPUT localhost:9200/test/type1/doc2 -d '{"id":"aaaabbbbbcc"}'
curl -XPUT localhost:9200/test/type1/doc3 -d '{"id":"aaaaaaaabcc"}'

Then I can get desired hits with specifying wildcard query together:
curl -XGET localhost:9200/test/type1/_search -d
'{"query":{"constant_score":{"filter":{"and":[
{"query":{"text":{"id":{"query":"bbbb","operator":"and"}}}},
{"query":{"wildcard":{"id.raw":"bbbb"}}}
]}}}}'

So far so good. But questions came up:

Question 1)
I used NGram analyzer for effective substring search.
However, I ended up with using wildcard filter to get rid of undesired hits.
Does this approach ruin my original intention, which is 'fast search'?

Question 2)
The final query seems to be redundant because using solely wildcard filter can do the job.
Is having the first text query contrubutes the performance?
In other words, is the wildcard query filter applied only to the results of the first text query?

Question 3)
Is there better way to get the substring search done?

Thank you for your help.

I did some experiment.
I compared search time of following queries.
A)
curl -XGET localhost:9200/test/type1/_search -d \
'{"query":{"constant_score":{
"query":{"wildcard":{"id.raw":"bbbb"}}
}}}'
B)
curl -XGET localhost:9200/test/type1/_search -d \
'{"query":{"constant_score":{"filter":{"and":[
{"query":{"text":{"id":{"query":"bbbb","operator":"and"}}}},
{"query":{"wildcard":{"id.raw":"bbbb"}}}
]}}}}'
C)
curl -XGET localhost:9200/test/type1/_search -d \
'{"query":{"filtered":{
"query":{"constant_score":{"query":{"text":{"id":{"query":"bbbb","operator":"and"}}}}},

"filter":{"query":{"wildcard":{"id.raw":"bbbb"}}}
}}}'

(Recap: "id" is NGram'ed and "id.raw" is not_analyzed)

I expected C to be the fastest, but I observed A was the fastest.
It was probably because the number of documents I have is only a couple of
thousand or so.

In theory, with a large amount of documents like millions, can I expect C
to be the fastest?
Thanks for your help.

I did some experiment.
I compared search time of following queries.
A)
curl -XGET localhost:9200/test/type1/_search -d
'{"query":{"constant_score":{
"query":{"wildcard":{"id.raw":"bbbb"}}
}}}'
B)
curl -XGET localhost:9200/test/type1/_search -d
'{"query":{"constant_score":{"filter":{"and":[
{"query":{"text":{"id":{"query":"bbbb","operator":"and"}}}},
{"query":{"wildcard":{"id.raw":"bbbb"}}}
]}}}}'
C)
curl -XGET localhost:9200/test/type1/_search -d
'{"query":{"filtered":{
"query":{"constant_score":{"query":{"text":{"id":{"query":"bbbb","operator":"and"}}}}},
"filter":{"query":{"wildcard":{"id.raw":"bbbb"}}}
}}}'

(Recap: "id" is NGram'ed and "id.raw" is not_analyzed)

I expected C to be the fastest, but I observed A was the fastest.
It was probably because the number of documents I have is only a couple of thousand or so.

In theory, with a large amount of documents like millions, can I expect C to be the fastest?
Thanks for your help.

Hi arta

On Mon, 2012-05-28 at 10:59 -0700, arta wrote:

I did some experiment.
I compared search time of following queries.
A)
curl -XGET localhost:9200/test/type1/_search -d
'{"query":{"constant_score":{
"query":{"wildcard":{"id.raw":"bbbb"}}
}}}'
B)
curl -XGET localhost:9200/test/type1/_search -d
'{"query":{"constant_score":{"filter":{"and":[
{"query":{"text":{"id":{"query":"bbbb","operator":"and"}}}},
{"query":{"wildcard":{"id.raw":"bbbb"}}}
]}}}}'
C)
curl -XGET localhost:9200/test/type1/_search -d
'{"query":{"filtered":{
"query":{"constant_score":{"query":{"text":{"id":{"query":"bbbb","operator":"and"}}}}},
"filter":{"query":{"wildcard":{"id.raw":"bbbb"}}}
}}}'

(Recap: "id" is NGram'ed and "id.raw" is not_analyzed)

If 'id' is ngram'ed, then you don't need to use wildcard queries on it.
Just a standard text query will suffice.

Have a look at this post:

It goes into a fair bit of detail about how to use ngrams

clint

Thank you clint,
The post you mentioned is very helpful.

However, the situation I have here is a bit different.
In your post, you assume "Your users will probably search for words, numbers or dates, but they probably won't expect 'ile' to match 'file'."
In my case, I need any-substring search capability.

Another reason I came up with applying wildcard filter idea was that I will eventually have a huge number of documents. So I did not want to set max_gram to be a big number. I believe increasing it increases index size exponentially. I would like to keep max_gram == 2 or 3 if it gets the job done. The length of 'id' is unlimited (well, practically it will be less than a couple of hundred chars), this is another reason I cannot set big-enough max_gram.

Thank you for your further help.

Hi Arta

However, the situation I have here is a bit different.
In your post, you assume "Your users will probably search for words, numbers
or dates, but they probably won't expect 'ile' to match 'file'."
In my case, I need any-substring search capability.

Then use ngrams instead of edge-ngrams.

Another reason I came up with applying wildcard filter idea was that I will
eventually have a huge number of documents. So I did not want to set
max_gram to be a big number. I believe increasing it increases index size
exponentially. I would like to keep max_gram == 2 or 3 if it gets the job
done. The length of 'id' is unlimited (well, practically it will be less
than a couple of hundred chars), this is another reason I cannot set
big-enough max_gram.

Wildcards are almost never the answer. They are inefficient and heavy.
ES has to load all your terms into memory, find which ones match the
pattern of your wildcard on the fly, then run the query.

Ngrams do this work (or similar) once at index time. They are much more
efficient and much faster. But yes, they do take up more space. But
frankly, space is seldom an issue in comparison to memory and CPU. And
they don't take up as much space as you think. It's not that it is
storing 'f', 'fo','foo' for every document. It just adds the document
ID to the document list for each relevant term.

As a worked example, take the two docs 'foobar' and 'boobaz', with a
min-gram of 2 and a max-gram of 4.

'foobar' gives you:

  • fo, oo, ob, ba, ar
  • foo, oob, oba, bar
  • foob, ooba, obar

'boobaz' gives you:

  • bo, oo, ob, ba, az
  • boo, oob, oba, baz
  • boob, ooba, obaz

Now let's search on 'obaz' - the search term is also broken down into
ngrams to give us:
foobar boobaz

  • ob X X
  • ba X X
  • az O X
  • oba X X
  • baz O X
  • obaz O X

'boobaz' is clearly the winner - it is more relevant

Also, if you do a "phrase" search, then it will make sure that the
matching term has the same ngrams in the SAME ORDER.

So you don't really need a max-gram of 500 characters to achieve what
you want.

Read the section entitled "UPDATE" on the post I linked to for more on
the phrase search with ngrams:

clint

Thank you so much clint,
I will consider having bigger max_gram and staying away from wildcard.
I'm convinced that's the better direction.

Now I tried min_gram=1 max_gram=100, recreated index, and did following query:
"text":{"id":{"operator":"and","query":"aaaa <snip. total 100 a's> aaa"}}

It got an error saying:
TooManyClauses[maxClauseCount is set to 1024]

So I googled a bit and added next line in config/elasticsearch.yml
index.query.bool.max_clause_count: 10000

After restarting ES, the query succeeds.
My interpretation of this experiment is that ES converts the text query into bool query with each analyzed words as its should elements. I think 100 character with 100-gram analyzer creates 5050 segments. (100 + 99 + ... + 1) therefore 5050 should elements.

My question here is, what will be the impact by increasing max_clause_count?
The default is 1024. There must be some reason the default value is that small.

Thanks for your help.

Oh, maybe I'm doing wong thing.
As Clint mentioned in his reply to another question here:
http://elasticsearch-users.115913.n3.nabble.com/help-needed-with-the-query-tt3177477.html#a3178856
I should have needed to apply different analyzer for search.
So the max clause count should not affect.
Is this the right way to go?

Thanks for your help.

Hi arta

On Thu, 2012-05-31 at 19:01 -0700, arta wrote:

Oh, maybe I'm doing wong thing.
As Clint mentioned in his reply to another question here:
http://elasticsearch-users.115913.n3.nabble.com/help-needed-with-the-query-tt3177477.html#a3178856
I should have needed to apply different analyzer for search.
So the max clause count should not affect.
Is this the right way to go?

Have a read of this:

especially the part starting with UPDATE

Also, you don't need a max gram of 100 - have a look at this post:
https://groups.google.com/d/msg/elasticsearch/O8iF7vUf3mA/EOqyapWA5VAJ

clint

Thank you very much for the quick response, Clinton!

The reason why I think my max_gram needs to be a big enough number, is following use case:
Let's say min_gram=2 max_gram=4.
document #1 has id='foooobar'
document #2 has id='fooooobar'
Then here's a search for 'ooooo'
If the id field uses the same NGram analyzer both for index and search, the search hits both of the two.
If the id field uses, say, standard analyzer for search, then the search hits none of the two.

Is there a better way (instead of setting big max_gram) to make this search successful?
Thanks, again, for your help. I really appreciate your help!

On Fri, 2012-06-01 at 10:33 -0700, arta wrote:

Thank you very much for the quick response, Clinton!

The reason why I think my max_gram needs to be a big enough number, is
following use case:
Let's say min_gram=2 max_gram=4.
document #1 has id='foooobar'
document #2 has id='fooooobar'
Then here's a search for 'ooooo'
If the id field uses the same NGram analyzer both for index and search, the
search hits both of the two.
If the id field uses, say, standard analyzer for search, then the search
hits none of the two.

Sure, that's a problem. What is the likelihood that you're going to
have such long repeating strings? And how big a problem is it? Depends
on your data really

clint

Let me think more deeply about our use cases.
I think I can come up with a reasonable max_gram number.
Thank you again for your help. It really really saved me.