Alternative to ngram tokenizer and token filter. Substring issue

Hey guys,

Currently I am looking for a means to search for a substring inside a phrase or text. I know that the two most common methods are is to use:

  1. ngram tokenizer/ token filter.
  2. Wild card search.

However, both suffer from a trade off. 1 the ngram tokenizer suffers from too much data overhead and its something we can't afford (note we would like ngrams of a large range to have more accuracy). While 2. the wild card search is unfortunately too slow, especially for sub-strings that appear often in searches and searches that begin with the * itself.

I was wondering if there was a one size fits all or a jack of all trades solution here. Where not too much memory used yet still remaining fast and accurate.

Regards Peter.
P.S I have thought about using the reverse token filter and generating two fields, and making two term queries to cover strings that start with and end with . However that solution did not cover, sub-strings. Strings in the middle of the term we want.

Are you trying to search for arbitrary substrings in terms? What sorts of
sample queries would you have

Would you expect a query "an" to match "banana"? Or are you looking for
some kind of typo tolerance -- single mistakes etc?

Yes I am trying to search for arbitrary sub-strings. So "an" in banana should work.

Reverse edgeNGram on the analyze side and prefix queries on the search side offers a decent tradeoff I think. I expect the space usage to be much, much, much better than ngrams and the performance to be significantly better than wildcard search.

1 Like

I'm just shooting from the hip here. I haven't tested it. But I expect it'll work.

Thanks I'll definitely try it out, bit worried about accuracy but that's something I will trade off .