Term matching with elastic search edge n gram

(Venkata Sreekanth Bhagavatula) #1

I have an custom analyzer with char replace filter, keyword tokenizer and lowercase filter. The problem I want solve is follows

Original string in index : 1812abcdefg

user input query : 812abcdefg

the index has other strings such as 812c, 812ab etc.

I want to be able to allow one or two characters preceding the edge ngram. I have tried ngram but the strings can of varying length from 1 to 50 characters and it throws off search.

Does anybody have any idea to do this?

(Zachary Tong) #2

So you essentially need a suffix search? E.g. *812abcdefg?

The best way to do that is add another analysis chain, which includes a reverse filter. Then use a prefix query.

So the reverse filter will index 1812abcdefg as gfedcba2181. When you use a prefix query on that same analysis chain, 812abcdefg will be converted into gfedcba218* and you'll get your match without an expensive suffix wildcard.

(Venkata Sreekanth Bhagavatula) #3

I will try that and get back to you

(Venkata Sreekanth Bhagavatula) #4

Is there a way to perform sub-string search, while the above mentioned reverse filter is definitely a good idea some times the original string in the index can be like

1812abcdefg-pby-reel, 1812abcdefg123 etc. in this scenario the reverse filter may not work. I do use char replace filter to remove non word characters.

(Zachary Tong) #5

You can also use ngrams / shingles with the reverse approach. So then you'll be indexing reversed fragments, which will match the prefix query. Basically the exact same analyzer you have now, except add a reverse filter to it.

Also, most people who implement this strategy also index the forward direction too, so that they get both prefix and suffix search.

(Venkata Sreekanth Bhagavatula) #6

ngrams is reverse is likely to score other tokens higher. Shingles is not an option since there are no stop words. The terms are like part numbers and have no stop words.

I hope there is some way to search sub strings without using regex or wildcard. Also, I am not able to embed regex or wildcard in bool query in the sense that it has not effect.

(Russ Cam) #7

You don't need stop words to use shingles

(Venkata Sreekanth Bhagavatula) #9

I am using ngram with diff min and max gram values to do substring search

(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.