How to find matches based on Longest Common N-gram

SteliosM · April 18, 2024, 6:19am

Hello Elastic Community,

I am currently working on a challenging problem where I need to find matches for an input string based on the longest common n-grams from a very large repository of indexed strings (around a billion entries). I need search query that prioritizes results based on contiguous substrings that appear in both the indexed strings and a new input query string.

For example, suppose I have the following strings indexed:

"Blue house"
"A nice blue house"
"I want to buy a nice blue house"
"house blue buy want I nice a" <- just to illustrate that I don't want to match tokens in random order, only n-grams

Given a query string such as "I do not want to buy a nice blue house", the ideal return would be "I want to buy a nice blue house", as it contains the longest n-gram match "want to buy a nice blue house" from the indexed strings.

However, I am not experienced with Elasticsearch and I am struggling with the proper configuration:

I tried using a match_phrase query, but it only checks if the entire input string is a substring of the indexed documents, which is not what I need.
I am aware of the n-gram tokenizer, but I am unsure how to configure and use it effectively when indexing the strings and when querying the index.

Any help here? Is there a better approach or tool I should consider given the volume and nature of the data?

Any insights or suggestions would be greatly appreciated!

Thank you!

system · May 16, 2024, 6:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Long Words Matching Elasticsearch	2	735	July 6, 2017
Ngram indexing and search results quality Elasticsearch	1	341	July 6, 2017
Prioritize exact match using nGram Elasticsearch	2	2406	July 6, 2017
Elastic Search - Search Query More weightage to exact matches Elasticsearch	1	1079	July 6, 2017
Term matching with elastic search edge n gram Elasticsearch	8	1891	March 7, 2017

How to find matches based on Longest Common N-gram

Related topics