ES 5.0-rc1 resolving should/minimum_should_match speed problem


(Alex Alexandre) #1

Hi,
I am not sure this is intended but when i play with bool->should queries it seems that the query gets really slow.
For my tests, i've made an index composed by 4 170 000 documents stored in 6 different types.
The full query :

{
"from": 0,
"size": 1,
"sort": [{
"_geo_distance": {
"geoloc": {
"lat": 39.9173,
"lon": 116.386
},
"order": "asc",
"unit": "km"
}
}],
"stored_fields": ["_type", "_id"],
"query": {
"bool": {
"must": [{
"match_all": {}
}, {
"geo_distance": {
"distance": "28km",
"geoloc": {
"lat": 39.9173,
"lon": 116.386
}
}
}],
"should": [
[{
"term": {
"_type": "1"
}
}, {
"geo_distance": {
"distance": "4km",
"geoloc": {
"lat": 39.9173,
"lon": 116.386
}
}
}],
[{
"term": {
"_type": "2"
}
}, {
"geo_distance": {
"distance": "4km",
"geoloc": {
"lat": 39.9173,
"lon": 116.386
}
}
}],
[{
"term": {
"_type": "3"
}
}, {
"geo_distance": {
"distance": "15km",
"geoloc": {
"lat": 39.9173,
"lon": 116.386
}
}
}],
[{
"term": {
"_type": "4"
}
}, {
"geo_distance": {
"distance": "28km",
"geoloc": {
"lat": 39.9173,
"lon": 116.386
}
}
}],
[{
"term": {
"_type": "5"
}
}, {
"geo_distance": {
"distance": "4km",
"geoloc": {
"lat": 39.9173,
"lon": 116.386
}
}
}],
[{
"term": {
"_type": "6"
}
}, {
"geo_distance": {
"distance": "28km",
"geoloc": {
"lat": 39.9173,
"lon": 116.386
}
}
}]
],
"minimum_should_match": 1,
"filter": []
}
}
}

The behavior is really strange if i add the part of the query in bold or not.

Without the bold part i have :

{
"took" : 277,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 4170789,
"max_score" : null,
"hits" : [
{
"_index" : "dev_geopoints",
"_type" : "4",
"_id" : "288466",
"_score" : null,
"sort" : [
13.378053070482325
]
}
]
}
}

but when I add my "must" request a filter on the maximum radius used in the "should", i have :

{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 8,
"max_score" : null,
"hits" : [
{
"_index" : "dev_geopoints",
"_type" : "4",
"_id" : "288466",
"_score" : null,
"sort" : [
13.378053070482325
]
}
]
}
}

In ES 2.X i could use "and"/"or" operators in the filters section, but in ES 5.0 i cannot use "or" filters and i have to use "should". That slows the queries a lot.
Am i missing something ?

Thanks,
Alex


(Zachary Tong) #2

The main reason it's slow is as you noted: without a must clause, the various combinations of should clauses have to be checked. And there are many more matching documents. With the addition of a must clause, you make it a very exclusive query and immediately reduce the number of matching docs, which means the should combinations are quick to check.

ES 5.0 i cannot use "or" filters and i have to use "should".

The arrangement is still possible using "filters", it just requires a slight different layout. You have to nest what you want to be filtered under a bool's filter clause, or a constant_score. For example, if you want all of those to run as a filter:

{
  "from":0,
  "size":1,
  "sort":[
    {
      "_geo_distance":{
        "geoloc":{
          "lat":39.9173,
          "lon":116.386
        },
        "order":"asc",
        "unit":"km"
      }
    }
  ],
  "stored_fields":[
    "_type",
    "_id"
  ],
  "query":{
    "bool":{
      "filter":[
        {
          "geo_distance":{
            "distance":"28km",
            "geoloc":{
              "lat":39.9173,
              "lon":116.386
            }
          }
        },
        {
          "bool":{
            "should":[
              {
                "term":{
                  "_type":"1"
                }
              },
              {
                "geo_distance":{
                  "distance":"4km",
                  "geoloc":{
                    "lat":39.9173,
                    "lon":116.386
                  }
                }
              },
              {
                "term":{
                  "_type":"2"
                }
              },
              {
                "geo_distance":{
                  "distance":"4km",
                  "geoloc":{
                    "lat":39.9173,
                    "lon":116.386
                  }
                }
              },
              {
                "term":{
                  "_type":"3"
                }
              },
              {
                "geo_distance":{
                  "distance":"15km",
                  "geoloc":{
                    "lat":39.9173,
                    "lon":116.386
                  }
                }
              },
              {
                "term":{
                  "_type":"4"
                }
              },
              {
                "geo_distance":{
                  "distance":"28km",
                  "geoloc":{
                    "lat":39.9173,
                    "lon":116.386
                  }
                }
              },
              {
                "term":{
                  "_type":"5"
                }
              },
              {
                "geo_distance":{
                  "distance":"4km",
                  "geoloc":{
                    "lat":39.9173,
                    "lon":116.386
                  }
                }
              },
              {
                "term":{
                  "_type":"6"
                }
              },
              {
                "geo_distance":{
                  "distance":"28km",
                  "geoloc":{
                    "lat":39.9173,
                    "lon":116.386
                  }
                }
              }
            ]
          }
        }
      ],
      "minimum_should_match":1
    }
  }
}

You can see in that example that the "must" clauses go directly into the filter of a boolean. All clauses in a filter are required just like a must but aren't scored. Then all the optional clauses go into a bool with a should. Because they are nested under the first bool's filter, they are executed as non-scoring filters, but are optional due to being in the second bool's should.

So you can read that query as: "MUST be within 28km of first geopoint and MUST be ((within 4km) OR (within 4km) OR (within 15km) OR ... )"

Also note, the existing syntax from your query is not valid:

"should":[
        [
          {
            "term":{
              "_type":"1"
            }
          },
          {
            "geo_distance":{
              "distance":"4km",
              "geoloc":{
                "lat":39.9173,
                "lon":116.386
              }
            }
          }
        ],

e.g. the double-nested arrays. The should, must, etc clauses only accept arrays of objects, not arrays of arrays of objects... I'm not sure what you're trying to express here?


(Alex Alexandre) #3

Ok thanks a lot. I'll try to optimise that way :slight_smile:


(Alex Alexandre) #4

Thanks to you i solved the problem. I wanted to do that
"should":[
{
"bool": {
"must": [{
"term": {
"_type": "1"
}
}, {
"geo_distance": {
"distance": "4km",
"geoloc": {
"lat": 39.9173,
"lon": 116.386
}
}
}]
}
}, ...

... my bad


(Zachary Tong) #5

No worries! The rightward drift of the DSL can be tricky at times, and takes a while to get used to nesting under the appropriate queries to express the boolean logic you want.

Goodluck, feel free to ping with more questions :slight_smile:


(system) #6