Want to do partial match without using wildcards

I want to do a partial match on text fields (these fields include Japanese text data, but more on that later, I want to nail the core functionality correctly before going into Japanese.) without using wildcards.

Here are some example matches I need:
--> [results]

  • "beatl" --> ["beatles", "beatl", ...]
  • "ance" --> ["dance", "chance", ...]
  • "bieb" --> ["Bieber", "Doobie Brothers", ...]

My experiments with wildcards:

  • I made the fields non-analyzed and used wildcard queries.
  • The performance was abysmal Since the queries I wrote included a leading asterisk (my data size is 70 million+ documents).
  • And over that I need to use multi_match over a set of text fields, So in all wildcards did not work.

My experiments with ngrams:

  • I used min and max-gram = 3, and tested. Below are the bad results that I don't want to show up.
  • "Beatl" --> ["atlas beaz feat", ...] (I could understand why this showed up. It was since it contained all the 3-grams from Beatl, But I don't want such results)

Some thoughts to improve with the n-gram search situation:

  • I wonder if I can do a post filter after receiving the results from Ngram. using wildcards here may not be as expensive.
  • Can I give a higher rating (or filtering) to n-grams that are overlapping? So that results like "Atlas Beaz feat" will not show up.

Is there any other way to make partial match possible?

Here is my index settings with n-grams.

"settings":{
      "analysis":{
        "char_filter":{
          "remove_spaces": {
            "type": "pattern_replace",
            "pattern": "(\\s)",
            "replacement": ""
          }
        },
        "tokenizer":{
          "eng_ngram_tokenizer": {
            "type": "ngram",
            "min_gram": 3,
            "max_gram": 3,
            "token_chars": [
              "letter",
              "digit",
              "punctuation",
              "symbol"
            ],
            "custom_token_chars": ["~","[","]","・"]
          }
        },
        "filter":{
          "custom_asciifolding":{
            "type":"asciifolding",
            "preserve_original":true
          }
        },
        "analyzer":{
          "eng_ngram_analyzer": {
            "type": "custom",
            "char_filter": [
              "remove_spaces"
            ],
            "tokenizer": "eng_ngram_tokenizer",
            "filter": [
              "lowercase",
              "custom_asciifolding"
            ]
          }
        }
      }
    },
...

Welcome!

When using ngrams, you want to apply them at index time but not on search time! Not on the query.

For that you need to tell Elasticsearch to use a simple analyzer at search time. See search_analyzer | Elasticsearch Guide [8.14] | Elastic

1 Like

@dadoonet Thanks for your response. Appreciate it :pray: !

So I tried using my custom ngram analyzer for index time. And standard analyzer (with the filters I am applying in index time analyzer) in search time.

**This is my search time analyzer: **

...
          "eng_search_analyzer": {
            "type": "custom",
            "char_filter": [
              "remove_spaces"
            ],
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "custom_asciifolding"
            ]
          }
...

(Refer to the question for the definitions of filter and char_filter)

But I saw no results, and my understanding is that's because I need to increase the max_gram to a higher value, the search query is not tokenized anymore.

I somehow feel it is not a good choice to do something like min_gram: 2 and max_gram: 20 on my data with 70 million, but please advice me if this is a commonly accepted practice.

Thanks,
Pratik

Elasticsearch has a special field type that uses ngrams to accelerate wildcard and regex queries: Find strings within strings faster with the Elasticsearch wildcard field | Elastic Blog

3 Likes

@Mark_Harwood1 This looks promising. I will try this out.
Thanks a lot.

If you don't succeed, the best thing to do is to create a full reproduction script so we can play with it in Kibana and propose concrete fixes. :slight_smile: