Two way substring search

Andr · April 28, 2023, 11:17am

Hello!

I would like to implement a search which returns a document if a query string is substring of the document field or the document field is substring of the query string.
For example, I have a document with a text field {"num": "123456"}.
I would like to get it if search queries are: "123456", "2345", "01234567"
I wouldn't like to get it if search queries are: "1234456", "23455", "012345667"
Are there any ways to implement such a behavior?

Thanks!

warkolm · April 30, 2023, 11:04pm

Welcome to our community!

You could try ngrams, which are expensive to implement due to the tokenisation.

Christian_Dahlqvist · May 1, 2023, 10:04am

As far as I know there is no easy and efficient way to do this. ngrams or wildcard search would find matches one way (when the search string is part of the indexed string). The opposite is trickier and will likely need to be handled separately and require one or more query clauses combined in a boolean clause. You may be able to use one or more ngram query time analysers to break up the query string into substrings and do an exact match for each against the indexed strings, but I think it would quickly become messy. I also suspect this might quickly become expensive for longer strings.

What is the minimum and maximum length of the strings you have indexed and are searching with?

Andr · May 1, 2023, 10:48am

Thanks! It's a phone number field. So, max lenght should be up to 17 symbols. I've already had ngram (3 gram) tokenization and it works ok when a query string shorter than the field of document (for expample doc is {"num": "123456"} and queries are: "123", "123456"). Mainly this problem due to a country code. Indexed docs and queries could be with or without the code. When I have a query string "+4930901820" I would like to get doc with the field "30901820".

Christian_Dahlqvist · May 1, 2023, 10:54am

Context helps as it sounds like you are not just looking for any substrings, but rather an exact match for a substring originating from the back. Is that correct?

Andr · May 1, 2023, 11:10am

Yes. For query string "+4930901820" I would like to get doc with the field "30901820". So the whole field "30901820" is a substring of the query "+4930901820"

Christian_Dahlqvist · May 1, 2023, 11:11am

Does the match need to be anchored at the end or should "3090182" to match as well?

Andr · May 1, 2023, 11:24am

Currently, the main goal is to find "30901820" as much perfomant as possible.

Christian_Dahlqvist · May 1, 2023, 5:59pm

I have not had time to experiment with it and am not sure if/when I will, but wanted to share some initial ideas.

One way of doing this might be to map the main field as keyword and have two multi-fields with custom mappings, one for each type of search (longer search string vs shorter indexed string and shorter search string vs longer indexed string). This means that you most likely would need a boolean query with two should clauses, one for each scenario.

For the scenario where you have a shorter search string and want to match to the same string or something that ends with the full search string, you can probably use a multi-field with wildcard mapping. You could use the keyword field as well with a wildcard query, but as you would have a leading wildcard this would be quite slow.

For the other scenario you might be able to create a custom analyser with a reverse filter followed by an edge ngram token filter. Am not sure which query type would work best with that, so you may need to experiment.

Andr · May 2, 2023, 12:28pm

Thank you for ideas! I'll try them

system · May 30, 2023, 12:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Query_string field specific search with nGram tokenizer Elasticsearch	1	354	July 6, 2017
Query to be used to find substring in a column Elasticsearch	5	1260	July 5, 2017
Exact Sub-String Match \| ElasticSearch Elasticsearch	4	3845	December 9, 2019
Query_string on n-gram field Elasticsearch	1	502	December 7, 2016
Better effective substring query idea? Elasticsearch	13	1500	July 6, 2017

Two way substring search

Related topics