How to find matches based on Longest Common N-gram

Hello Elastic Community,

I am currently working on a challenging problem where I need to find matches for an input string based on the longest common n-grams from a very large repository of indexed strings (around a billion entries). I need search query that prioritizes results based on contiguous substrings that appear in both the indexed strings and a new input query string.

For example, suppose I have the following strings indexed:

  • "Blue house"
  • "A nice blue house"
  • "I want to buy a nice blue house"
  • "house blue buy want I nice a" <- just to illustrate that I don't want to match tokens in random order, only n-grams

Given a query string such as "I do not want to buy a nice blue house", the ideal return would be "I want to buy a nice blue house", as it contains the longest n-gram match "want to buy a nice blue house" from the indexed strings.

However, I am not experienced with Elasticsearch and I am struggling with the proper configuration:

  • I tried using a match_phrase query, but it only checks if the entire input string is a substring of the indexed documents, which is not what I need.
  • I am aware of the n-gram tokenizer, but I am unsure how to configure and use it effectively when indexing the strings and when querying the index.

Any help here? Is there a better approach or tool I should consider given the volume and nature of the data?

Any insights or suggestions would be greatly appreciated!

Thank you!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.