Want to do partial match without using wildcards

PratikStar · July 11, 2024, 4:32am

I want to do a partial match on text fields (these fields include Japanese text data, but more on that later, I want to nail the core functionality correctly before going into Japanese.) without using wildcards.

Here are some example matches I need:
--> [results]

"beatl" --> ["beatles", "beatl", ...]
"ance" --> ["dance", "chance", ...]
"bieb" --> ["Bieber", "Doobie Brothers", ...]

My experiments with wildcards:

I made the fields non-analyzed and used wildcard queries.
The performance was abysmal Since the queries I wrote included a leading asterisk (my data size is 70 million+ documents).
And over that I need to use multi_match over a set of text fields, So in all wildcards did not work.

My experiments with ngrams:

I used min and max-gram = 3, and tested. Below are the bad results that I don't want to show up.
"Beatl" --> ["atlas beaz feat", ...] (I could understand why this showed up. It was since it contained all the 3-grams from Beatl, But I don't want such results)

Some thoughts to improve with the n-gram search situation:

I wonder if I can do a post filter after receiving the results from Ngram. using wildcards here may not be as expensive.
Can I give a higher rating (or filtering) to n-grams that are overlapping? So that results like "Atlas Beaz feat" will not show up.

Is there any other way to make partial match possible?

Here is my index settings with n-grams.

"settings":{
      "analysis":{
        "char_filter":{
          "remove_spaces": {
            "type": "pattern_replace",
            "pattern": "(\\s)",
            "replacement": ""
          }
        },
        "tokenizer":{
          "eng_ngram_tokenizer": {
            "type": "ngram",
            "min_gram": 3,
            "max_gram": 3,
            "token_chars": [
              "letter",
              "digit",
              "punctuation",
              "symbol"
            ],
            "custom_token_chars": ["~","[","]","・"]
          }
        },
        "filter":{
          "custom_asciifolding":{
            "type":"asciifolding",
            "preserve_original":true
          }
        },
        "analyzer":{
          "eng_ngram_analyzer": {
            "type": "custom",
            "char_filter": [
              "remove_spaces"
            ],
            "tokenizer": "eng_ngram_tokenizer",
            "filter": [
              "lowercase",
              "custom_asciifolding"
            ]
          }
        }
      }
    },
...

dadoonet · July 11, 2024, 5:05am

Welcome!

When using ngrams, you want to apply them at index time but not on search time! Not on the query.

For that you need to tell Elasticsearch to use a simple analyzer at search time. See search_analyzer | Elasticsearch Guide [8.14] | Elastic

PratikStar · July 11, 2024, 6:41am

@dadoonet Thanks for your response. Appreciate it !

So I tried using my custom ngram analyzer for index time. And standard analyzer (with the filters I am applying in index time analyzer) in search time.

**This is my search time analyzer: **

...
          "eng_search_analyzer": {
            "type": "custom",
            "char_filter": [
              "remove_spaces"
            ],
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "custom_asciifolding"
            ]
          }
...

(Refer to the question for the definitions of filter and char_filter)

But I saw no results, and my understanding is that's because I need to increase the max_gram to a higher value, the search query is not tokenized anymore.

I somehow feel it is not a good choice to do something like min_gram: 2 and max_gram: 20 on my data with 70 million, but please advice me if this is a commonly accepted practice.

Thanks,
Pratik

Mark_Harwood1 · July 11, 2024, 7:16am

Elasticsearch has a special field type that uses ngrams to accelerate wildcard and regex queries: Find strings within strings faster with the Elasticsearch wildcard field | Elastic Blog

PratikStar · July 11, 2024, 7:56am

@Mark_Harwood1 This looks promising. I will try this out.
Thanks a lot.

dadoonet · July 11, 2024, 10:33am

If you don't succeed, the best thing to do is to create a full reproduction script so we can play with it in Kibana and propose concrete fixes.

Topic		Replies	Views
nGrams and Wildcards Elasticsearch	2	442	July 6, 2017
nGram and wildcards Elasticsearch	4	1585	July 6, 2017
Partial word matching with query_string and edge ngrams Elasticsearch	3	967	July 6, 2017
Is it possible to use partial (wild card )search along with fuzzy match? Elasticsearch	1	361	January 30, 2019
Guidance on partial keyword search Elastic Search elastic-site-search	5	1415	February 13, 2024

Want to do partial match without using wildcards

Related topics