Minimun_should_match does not work at ES 5.* version

I deploy ES 5. version, and use ngram as char split like url https://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html,*
Index settings and mappings as above url like this:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
}
},
"analyzer": {
"trigrams": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trigrams_filter"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "string",
"analyzer": "trigrams"
}
}
}
}
}
Then index data as above url:
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "text": "Aussprachewörterbuch" }
{ "index": { "_id": 2 }}
{ "text": "Militärgeschichte" }
{ "index": { "_id": 3 }}
{ "text": "Weißkopfseeadler" }
{ "index": { "_id": 4 }}
{ "text": "Weltgesundheitsorganisation" }
{ "index": { "_id": 5 }}
{ "text": "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz" }

But minimum_should_match does not work when excute phrase as follows:
GET /my_index/my_type/_search
{
"query": {
"match": {
"text": {
"query": "Gesundheit",
"minimum_should_match": "80%"
}
}
}
}
result still matches “Militär-ges-chichte” and “Rindfleischetikettierungsüberwachungsaufgabenübertragungs-ges-etz,” both of which also contain the trigram ges. Any ideas to process this condition? thanks

You can use the validate-query API to figure out why this happens.

GET /my_index/_validate/query?explain
{
  "query": {
    "match": {
      "text": {
        "query": "Gesundheit",
        "minimum_should_match": "80%"
      }
    }
  }
}

returns:

"explanation": "Synonym(text:dhe text:eit text:esu text:ges text:hei text:ndh text:sun text:und)"

In other words, all the trigrams are considered to be synonyms and, as such, are treated as a single word when using minimum_must_match. This is because you've used the ngram token filter, which returns the same term position for all ngrams in a single word.

The internal Synonym query is a fairly recent addition which changed the interaction of minimum_should_match with tokens in the same position.

If you use the ngrams tokenizer instead, then you will get what you want:

PUT /my_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "trigrams_filter": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3
        }
      },
      "analyzer": {
        "trigrams": {
          "type": "custom",
          "tokenizer": "trigrams_filter",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "trigrams"
        }
      }
    }
  }
}

POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "text": "Aussprachewörterbuch" }
{ "index": { "_id": 2 }}
{ "text": "Militärgeschichte" }
{ "index": { "_id": 3 }}
{ "text": "Weißkopfseeadler" }
{ "index": { "_id": 4 }}
{ "text": "Weltgesundheitsorganisation" }
{ "index": { "_id": 5 }}
{ "text": "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz" }

GET /my_index/_validate/query?explain
{
  "query": {
    "match": {
      "text": {
        "query": "Gesundheit",
        "minimum_should_match": "80%"
      }
    }
  }
}

returns:

"explanation": "(text:ges text:esu text:sun text:und text:ndh text:dhe text:hei text:eit)~6"

and the query:

GET /my_index/_search
{
  "query": {
    "match": {
      "text": {
        "query": "Gesundheit",
        "minimum_should_match": "80%"
      }
    }
  }
}

returns just:

    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "4",
        "_score": 4.2928576,
        "_source": {
          "text": "Weltgesundheitsorganisation"
        }
      }
    ]

Thank for your reply.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.