I have an custom analyzer with char replace filter, keyword tokenizer and lowercase filter. The problem I want solve is follows
Original string in index : 1812abcdefg
user input query : 812abcdefg
the index has other strings such as 812c, 812ab etc.
I want to be able to allow one or two characters preceding the edge ngram. I have tried ngram but the strings can of varying length from 1 to 50 characters and it throws off search.
So you essentially need a suffix search? E.g. *812abcdefg?
The best way to do that is add another analysis chain, which includes a reverse filter. Then use a prefix query.
So the reverse filter will index 1812abcdefg as gfedcba2181. When you use a prefix query on that same analysis chain, 812abcdefg will be converted into gfedcba218* and you'll get your match without an expensive suffix wildcard.
Is there a way to perform sub-string search, while the above mentioned reverse filter is definitely a good idea some times the original string in the index can be like
1812abcdefg-pby-reel, 1812abcdefg123 etc. in this scenario the reverse filter may not work. I do use char replace filter to remove non word characters.
You can also use ngrams / shingles with the reverse approach. So then you'll be indexing reversed fragments, which will match the prefix query. Basically the exact same analyzer you have now, except add a reverse filter to it.
Also, most people who implement this strategy also index the forward direction too, so that they get both prefix and suffix search.
ngrams is reverse is likely to score other tokens higher. Shingles is not an option since there are no stop words. The terms are like part numbers and have no stop words.
I hope there is some way to search sub strings without using regex or wildcard. Also, I am not able to embed regex or wildcard in bool query in the sense that it has not effect.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.