Hello Elastic Community,
I am currently working on a challenging problem where I need to find matches for an input string based on the longest common n-grams from a very large repository of indexed strings (around a billion entries). I need search query that prioritizes results based on contiguous substrings that appear in both the indexed strings and a new input query string.
For example, suppose I have the following strings indexed:
- "Blue house"
- "A nice blue house"
- "I want to buy a nice blue house"
- "house blue buy want I nice a" <- just to illustrate that I don't want to match tokens in random order, only n-grams
Given a query string such as "I do not want to buy a nice blue house", the ideal return would be "I want to buy a nice blue house", as it contains the longest n-gram match "want to buy a nice blue house" from the indexed strings.
However, I am not experienced with Elasticsearch and I am struggling with the proper configuration:
- I tried using a
match_phrase
query, but it only checks if the entire input string is a substring of the indexed documents, which is not what I need. - I am aware of the n-gram tokenizer, but I am unsure how to configure and use it effectively when indexing the strings and when querying the index.
Any help here? Is there a better approach or tool I should consider given the volume and nature of the data?
Any insights or suggestions would be greatly appreciated!
Thank you!