How to find matches based on Longest Common N-gram

Hello Elastic Community,

I am currently working on a challenging problem where I need to find matches for an input string based on the longest common n-grams from a very large repository of indexed strings (around a billion entries). I need search query that prioritizes results based on contiguous substrings that appear in both the indexed strings and a new input query string.

For example, suppose I have the following strings indexed:

  • "Blue house"
  • "A nice blue house"
  • "I want to buy a nice blue house"
  • "house blue buy want I nice a" <- just to illustrate that I don't want to match tokens in random order, only n-grams

Given a query string such as "I do not want to buy a nice blue house", the ideal return would be "I want to buy a nice blue house", as it contains the longest n-gram match "want to buy a nice blue house" from the indexed strings.

However, I am not experienced with Elasticsearch and I am struggling with the proper configuration:

  • I tried using a match_phrase query, but it only checks if the entire input string is a substring of the indexed documents, which is not what I need.
  • I am aware of the n-gram tokenizer, but I am unsure how to configure and use it effectively when indexing the strings and when querying the index.

Any help here? Is there a better approach or tool I should consider given the volume and nature of the data?

Any insights or suggestions would be greatly appreciated!

Thank you!