Two way substring search

Hello!

I would like to implement a search which returns a document if a query string is substring of the document field or the document field is substring of the query string.
For example, I have a document with a text field {"num": "123456"}.
I would like to get it if search queries are: "123456", "2345", "01234567"
I wouldn't like to get it if search queries are: "1234456", "23455", "012345667"
Are there any ways to implement such a behavior?

Thanks!

Welcome to our community! :smiley:

You could try ngrams, which are expensive to implement due to the tokenisation.

1 Like

As far as I know there is no easy and efficient way to do this. ngrams or wildcard search would find matches one way (when the search string is part of the indexed string). The opposite is trickier and will likely need to be handled separately and require one or more query clauses combined in a boolean clause. You may be able to use one or more ngram query time analysers to break up the query string into substrings and do an exact match for each against the indexed strings, but I think it would quickly become messy. I also suspect this might quickly become expensive for longer strings.

What is the minimum and maximum length of the strings you have indexed and are searching with?

1 Like

Thanks! It's a phone number field. So, max lenght should be up to 17 symbols. I've already had ngram (3 gram) tokenization and it works ok when a query string shorter than the field of document (for expample doc is {"num": "123456"} and queries are: "123", "123456"). Mainly this problem due to a country code. Indexed docs and queries could be with or without the code. When I have a query string "+4930901820" I would like to get doc with the field "30901820".

Context helps as it sounds like you are not just looking for any substrings, but rather an exact match for a substring originating from the back. Is that correct?

Yes. For query string "+4930901820" I would like to get doc with the field "30901820". So the whole field "30901820" is a substring of the query "+4930901820"

Does the match need to be anchored at the end or should "3090182" to match as well?

Currently, the main goal is to find "30901820" as much perfomant as possible.

I have not had time to experiment with it and am not sure if/when I will, but wanted to share some initial ideas.

One way of doing this might be to map the main field as keyword and have two multi-fields with custom mappings, one for each type of search (longer search string vs shorter indexed string and shorter search string vs longer indexed string). This means that you most likely would need a boolean query with two should clauses, one for each scenario.

For the scenario where you have a shorter search string and want to match to the same string or something that ends with the full search string, you can probably use a multi-field with wildcard mapping. You could use the keyword field as well with a wildcard query, but as you would have a leading wildcard this would be quite slow.

For the other scenario you might be able to create a custom analyser with a reverse filter followed by an edge ngram token filter. Am not sure which query type would work best with that, so you may need to experiment.

1 Like

Thank you for ideas! I'll try them

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.