Filtering with minimum_should_match: "100%" not using all tokens in field


(Vaughn Dickson) #1

Hi,

We have a sourceUrl compound field with a custom analyzer for the sourceUrl.pathName component. The idea is that we can run a query to get all urls in a path, e.g. example.com/path/here brings back everything under example.com/path/here.*.

This all worked in ES 1.7, but I can't get it working in 2.4. It's almost like 2.4 is only matching on one or two tokens instead of all of them as I'd expect minimum_should_match: "100%" to do.

So even if I use a very specific filter like "http://www.animalfactguide.com/category/animal-news/page/6/" I get all the other pages at "http://www.animalfactguide.com/category/animal-news/page/10/" etc.

Query

{"query": {
    "bool" : {
      "filter" : {
        "match" : {
          "sourceUrl.pathName" : {
            "query" : "http://www.animalfactguide.com/category/animal-news/page/6/",
            "type" : "boolean",
            "minimum_should_match" : "100%"
          }
        }
      }
    }
  }
}

Mappings and analyzers:

{
  "settings": {
    "index": {
      "analysis": {
        "char_filter": {
          "drop_trailing_slash": {
            "pattern": "/$",
            "type": "pattern_replace",
            "replacement": ""
          },
          "path": {
            "type": "pattern_replace",
            "pattern": "^(.*://)?([^/]*)((/[^?]*?)?(/([^/]*.html?)?)?)(\?.*)?$",
            "replacement": "$3"
          },
          "drop_leading_slash": {
            "type": "pattern_replace",
            "pattern": "^/",
            "replacement": ""
          },
        },
        "analyzer": {
          "pathName": {
            "filter": "lowercase",
            "char_filter": [
              "path",
              "drop_leading_slash",
              "drop_trailing_slash"
            ],
            "type": "custom",
            "tokenizer": "pathName"
          },
        },
        "tokenizer": {
          "pathName": {
            "type": "path_hierarchy",
            "reverse": "false",
            "delimiter": "/"
          },
        }
      },        
    }
  },
  "mappings": {
  "page": {
    "properties": {
      "sourceUrl": {
      "index": "no",
      "type": "string",
      "fields": {
        "pathName": {
          "analyzer": "pathName",
          "type": "string"
        },
        "raw": {
          "index": "not_analyzed",
          "type": "string"
        }
      }
    }
  }
}

Any help and pointers would be appreciated. I'm running out of options :confused:


(Vaughn Dickson) #2

I narrowed the problem down a bit further with some debugging of ES source.

It looks like coord_disabled is set to true by lucene:

which stops minimumShouldMatch from being applied here:

I'm yet to understand why.


(Vaughn Dickson) #3

Ok. Looks like I've hit a bug in ES. We're using a match query with boolean type, which heads down the code path where Lucene setDisableCoords(true), and then minimumShouldMatch is ignore....


(Vaughn Dickson) #4

I'm forced to use cutoff_frequency to head down the createCommonTermsQuery path to get the correct behaviour. Not sure if this is a bug in ES/Lucene, but setDisableCoord(true) has been removed in Lucene 6.x, so it should in a newer ES eventually.

{
  "query": {
    "bool" : {
      "filter" : {
        "match" : {
          "sourceUrl.pathName" : {
            "query" : "http://www.animalfactguide.com/category/animal-news/page/",
            "type" : "boolean",
            "minimum_should_match" : "100%",
"cutoff_frequency": 0.99
          }
        }
      }
    }
  }
}

(Vaughn Dickson) #5

Filed a bug report: https://github.com/elastic/elasticsearch/issues/20581


(system) #6