Search: minimum_should_match: 100% not working when single term expands two more terms

Hey,

I found an interesting behaviour for an OR based query that has minimum_should_match: 100% set. Naively I thought that this means, it behaves the same as an AND, but it does not.

Example:

DELETE test 

PUT test
{
  "mappings": {
    "properties": {
      "title" : {
        "type": "text",
        "analyzer": "my_delimiter_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_delimiter_analyzer": {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : ["lowercase", "word_delimiter"]
        }
      }
    }
  }
}

PUT test/_doc/1
{
  "title" : "chain 8mm yellow"
}

PUT test/_doc/2
{
  "title" : "chain mm yellow"
}

POST test/_refresh

Now let's run an analyze request

GET test/_analyze
{
  "field": "title", 
  "text": "chain 8mm yellow"
}

Returns as expected due to the word delimeter filter:

{
  "tokens" : [
    {
      "token" : "chain"
    },
    {
      "token" : "8"
    },
    {
      "token" : "mm"
    },
    {
      "token" : "yellow"
    }
  ]
}

Now, let's do a simple AND based query, that returns one document:

GET test/_search
{
  "query": {
    "simple_query_string": {
      "default_operator": "AND",
      "fields": [
        "title"
      ],
      "query": "chain 8mm yellow"
    }
  }
}

Now, let's do the magic (or the bug?):

# returns both hits - but why? It is minimum_should_match: 100%?!
GET test/_search
{
  "query": {
    "simple_query_string": {
      "default_operator": "OR",
      "fields": [
        "title"
      ],
      "minimum_should_match": "100%",
      "query": "chain 8mm yellow"
    }
  }
}

This returns both documents, and when looking at the validate output it's also clear why:

GET test/_validate/query?explain=true
{
  "query": {
    "simple_query_string": {
      "default_operator": "OR",
      "fields": [
        "title"
      ],
      "minimum_should_match": "100%",
      "query": "chain 8mm yellow"
    }
  }
}

Output is:

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "test",
      "valid" : true,
      "explanation" : "(title:chain (title:8 title:mm) title:yellow)~3"
    }
  ]
}

So, basically this query moves from an expected chain OR 8mm OR yellow with 100% matching to an chain OR (8 OR mm) OR yellow) - but without keeping the minimum_should_match: 100% for the inner OR query.

So, long story short: is this a bug or a feature? To me it feels buggish on first sight, but I guess in another setup it makes sense?

This is on 7.17.18

Thanks for reading through here and have a nice weekend!

--Alex

Hey Alex!

This seems to be a bug with simple_query_string. If I switch to query_string or disable the whitespace operator, I'm no longer seeing this problem, e.g.

GET test/_validate/query?explain=true
{
  "query": {
    "simple_query_string": {
      "default_operator": "OR",
      "fields": [
        "title"
      ],
      "minimum_should_match": "100%",
      "query": "chain 8mm yellow",
      "flags": "OR|AND|PREFIX"
    }
  }
}
1 Like

I found the root cause of this problem.

When the WHITESPACE operator is configured, simple_query_string first splits on whitespace, then creates a query for each split, then puts all these clauses in a boolean query. Because 8mm has no whitespace in it, it's parsed as a single clause within this top-level boolean query.

Then minimum_should_match is applied only on the top-level boolean query, so it only sees 3 clauses.

I'm not sure if it's a bug or a feature. The WHITESPACE flag means that each split is a new clause, so it sort-of makes sense to treat the query produced by 8mm as a single clause within the top-level boolean query.

2 Likes