Setting min=max in ngram tokenizers

cawoodm · October 2, 2023, 11:10am

The docs suggest setting min=max on ngrams - but this results in a very poor search experience. Assuming 3,3 and searching for "rough" you will never find "trough". Only if you search "tro", "rou" or any other 3 character string will you find something.

It usually makes sense to set min_gram and max_gram to the same value. The smaller the length, the more documents will match but the lower the quality of the matches. The longer the length, the more specific the matches. A tri-gram (length 3 ) is a good place to start.

IMHO these values should not be equal but defined as follows:

min: The shortest string you want to deliver results (typically 3, sometimes 2 depending on the use case)
max: The longest string you want to deliver results (typically 10, maybe as high as 20 depending on how your users search

I feel the tip does not explain it's reasoning and leads users to frustration since 3,3 is basically useless.

Christian_Dahlqvist · October 2, 2023, 11:14am

Run a simple test with the example you used and you will see that that is not how ngrams work. The ngram tokenisation is performed both at index and query time (unless you specify a different analyser to be used at search time), so the example you provided will match as several trigrams will be in common.

cawoodm · October 2, 2023, 12:56pm

I have of course tested this. (3,3) returns zero results for rough whilst (2,20) does.

Using /_analyze it's easy to see why (3,3) isn't working:

POST /_analyze
{
  "tokenizer": {
    "type": "ngram",
    "min_gram": 3,
    "max_gram": 3,
    "token_chars": [
      "letter",
      "digit"
    ]
  },
  "filter": [
    "lowercase"
  ],
  "text": "Trough"
}

And (2,20) is:

tro
rou
oug
ugh

POST /_analyze
{
  "tokenizer": {
    "type": "ngram",
    "min_gram": 2,
    "max_gram": 20,
    "token_chars": [
      "letter",
      "digit"
    ]
  },
  "filter": [
    "lowercase"
  ],
  "text": "Trough"
}



tr
tro
trou
troug
trough
ro
rou
roug
rough
ou
oug
ough
ug
ugh
gh

Only with 2,20 do we see "rough" in the inverted index.

However I do understand your point and would be interested how to get it to work. We tested with POST /_search?q=fulltext.en:rough and get zero results. Somehow the tokenizer is not being applied at search time.

But to be honest we don't want user's search of "rough" to provide the same results as "rou" so I'm not sure this is a route we want to take.

Christian_Dahlqvist · October 2, 2023, 1:24pm

With (3,3) ngram both rough and trough will generate the tokens rou, oug and ugh both at index and query time, which means they will match.

What is the mapping for that field?

You can to some extent control this by setting minimum_should_match. If a search string generates multiple tokens you can specify that you want a portion of these to match for it to be successful. This would result in a different search result for rou and rough search strings.

cawoodm · October 3, 2023, 7:47am

{
  ...
  "mappings": {
    "dynamic": "strict",
    "properties": {
     ...
      "fulltext": {
        "properties": {
          "de": {
            "analyzer": "ngram2",
            "search_analyzer": "whitesp",
            "type": "text"
          },
          "en": {
            "analyzer": "ngram2",
            "search_analyzer": "whitesp",
            "type": "text"
          }
        }
      ....
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "ngram2": {
            "filter": [
              "ngram2",
              "lowercase"
            ],
            "tokenizer": "whitespace"
          },
          "whitesp": {
            "filter": [
              "lowercase"
            ],
            "tokenizer": "whitespace"
          }
        },
        "filter": {
          "ngram2": {
            "max_gram": "20",
            "min_gram": "2",
            "type": "ngram"
          }
        }
      },
      "number_of_replicas": "1",
      "number_of_shards": "1",
      "max_ngram_diff": 18
    }
  }
}

Christian_Dahlqvist · October 3, 2023, 9:09am

You have a separate search analyser that does not generate ngrams, which explains why you do not get any match. If you remove this you will get a match as the ngram analyser will be used for both indexing and searching.

system · October 31, 2023, 9:09am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.