Currently I am looking for a means to search for a substring inside a phrase or text. I know that the two most common methods are is to use:
ngram tokenizer/ token filter.
Wild card search.
However, both suffer from a trade off. 1 the ngram tokenizer suffers from too much data overhead and its something we can't afford (note we would like ngrams of a large range to have more accuracy). While 2. the wild card search is unfortunately too slow, especially for sub-strings that appear often in searches and searches that begin with the * itself.
I was wondering if there was a one size fits all or a jack of all trades solution here. Where not too much memory used yet still remaining fast and accurate.
Regards Peter.
P.S I have thought about using the reverse token filter and generating two fields, and making two term queries to cover strings that start with and end with . However that solution did not cover, sub-strings. Strings in the middle of the term we want.
Reverse edgeNGram on the analyze side and prefix queries on the search side offers a decent tradeoff I think. I expect the space usage to be much, much, much better than ngrams and the performance to be significantly better than wildcard search.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.