Alternative to ngram tokenizer and token filter. Substring issue

lantern77 · June 20, 2016, 6:14pm

Hey guys,

Currently I am looking for a means to search for a substring inside a phrase or text. I know that the two most common methods are is to use:

ngram tokenizer/ token filter.
Wild card search.

However, both suffer from a trade off. 1 the ngram tokenizer suffers from too much data overhead and its something we can't afford (note we would like ngrams of a large range to have more accuracy). While 2. the wild card search is unfortunately too slow, especially for sub-strings that appear often in searches and searches that begin with the * itself.

I was wondering if there was a one size fits all or a jack of all trades solution here. Where not too much memory used yet still remaining fast and accurate.

Regards Peter.
P.S I have thought about using the reverse token filter and generating two fields, and making two term queries to cover strings that start with and end with . However that solution did not cover, sub-strings. Strings in the middle of the term we want.

softwaredoug · June 20, 2016, 7:16pm

Are you trying to search for arbitrary substrings in terms? What sorts of
sample queries would you have

Would you expect a query "an" to match "banana"? Or are you looking for
some kind of typo tolerance -- single mistakes etc?

lantern77 · June 20, 2016, 7:36pm

Yes I am trying to search for arbitrary sub-strings. So "an" in banana should work.

nik9000 · June 20, 2016, 7:52pm

Reverse edgeNGram on the analyze side and prefix queries on the search side offers a decent tradeoff I think. I expect the space usage to be much, much, much better than ngrams and the performance to be significantly better than wildcard search.

nik9000 · June 20, 2016, 7:53pm

I'm just shooting from the hip here. I haven't tested it. But I expect it'll work.

lantern77 · June 20, 2016, 7:54pm

Thanks I'll definitely try it out, bit worried about accuracy but that's something I will trade off .

Topic		Replies	Views
Better effective substring query idea? Elasticsearch	13	1528	July 6, 2017
Term matching with elastic search edge n gram Elasticsearch	8	1900	March 7, 2017
Substring search Elasticsearch	2	464	July 6, 2017
Long Words Matching Elasticsearch	2	736	July 6, 2017
Storage problem with ngram filters Elasticsearch	8	1245	November 24, 2017

Alternative to ngram tokenizer and token filter. Substring issue

Related topics