Performance difference using bool with filter

milodky · November 5, 2015, 7:34pm

Hi,

I'm currently testing the performance of ElasticSearch 2.0.

My original query is:

"query":{
    "filtered":{
      "filter":{
        "term":{
          "last_name":"brown"
        }
      }
    }
  }

The took value is roughly 260ms.

However when I switched to bool with filter as follow(since filtered is deprecated):

  "query":{
    "bool":{
      "filter":[
        {
          "term":{
            "last_name":"brown"
          }
        }
      ]
    }
  }

I expected two took values to be similar. But this time the took value bumped up to around 300ms.

The size of the matched records is 5.9 million. And I don't care about the score at all for now. Both tests are done on ES 2.0 on the same cluster.

Is there any better query?

Thanks,

Xiaoting

polyfractal · November 5, 2015, 10:24pm

Hm, it should be same/similar ... under the covers they are executing almost exactly the same query.

Did you run the test multiple times for both queries? I'm not saying there isn't an effect there, but the time difference is small enough it could simply be random chance (e.g. noise) and not a statistically significant effect.

Ideally:

Run the filtered query a few dozen times to warm up the OS cache and JVM
Run it another 20 times and record all values
Shut the server down. Restart
Run bool a few dozen times to warm up OS cache and JVM
Run it another 20 times and record all values

Then compare the values (preferably with a T-Test) to see if there is a real difference. It's not fun, but benchmarking is so easily skewed by the smallest things...you really have to take a statistical population approach.

With that said, I checked what Lucene queries are being run. The filtered is doing this in 2.0:

GET /test/_validate/query?explain=true
{
   "query": {
      "filtered": {
         "filter": {
            "term": {
               "last_name": "brown"
            }
         }
      }
   }
}


{
   "valid": true,
   "_shards": {...},
   "explanations": [
      {
         "index": "test",
         "valid": true,
         "explanation": "+*:* #last_name:brown"
      }
   ]
}

Whereas the bool is doing this:

GET /test/_validate/query?explain=true
{
   "query": {
      "bool": {
         "filter": [
            {
               "term": {
                  "last_name": "brown"
               }
            }
         ]
      }
   }
}

{
   "valid": true,
   "_shards": {...},
   "explanations": [
      {
         "index": "test",
         "valid": true,
         "explanation": "#last_name:brown"
      }
   ]
}

Which is slightly different (the filtered is including a match_all query in there). If you want to try, you could try adding a match_all to the bool and see if it improves. I'd be surprised if it does, but perhaps there is an optimization that's not being used otherwise?

GET /test/_validate/query?explain=true
{
   "query": {
      "bool": {
         "must": [
            {
               "match_all": {}
            }
         ],
         "filter": [
            {
               "term": {
                  "last_name": "brown"
               }
            }
         ]
      }
   }
}

However...I'll stress my first point, it's likely just noise. Make sure you run these a number of times to even out noise, try to isolate any other processes from touching the cluster, etc.

polyfractal · November 5, 2015, 10:26pm

Also /cc @jpountz in case he's interested / has input.

milodky · November 6, 2015, 11:13pm

Yes. I ran each for at least 20 times and the results appeared the same.

polyfractal · November 7, 2015, 2:55pm

Interesting. Did you try adding the must to the bool to see if it gives the same performance boost? Do both queries return the same number of documents? How many documents are in your index (and how many are being returned)?

If I get some time this weekend I'll see if I can work up a similar benchmark locally to reproduce.

jpountz · November 9, 2015, 4:35pm

Both queries should behave roughly the same indeed. As Zach mentioned it's not entirely clear whether the difference is significative here.

As a side note, if the query that you want to express is "find everything that matches this term regardless of scoring", then it's more appropriate to build a constant_score than a bool query with a single filter clause or even a filtered query that only has a query. Not that it should change performance (bool queries with a single filter clause should rewrite to a constant-score query) but it should make your queries a bit more readable.

milodky · November 9, 2015, 6:00pm

I tried must for 10 times and the results are [340, 339, 341, 338, 335, 335, 336, 433, 345, 340]
For filter: [337, 405, 340, 338, 336, 342, 336, 334, 337, 346]
For filtered: [305, 373, 308, 309, 308, 309, 305, 307, 311, 306]

The queries are listed below:

curl -X GET 'http://localhost:9200/index6/_search?fields=_id&from=0&size=100&pretty' -d '{
  "query":{
    "bool":{
      "must":[
        {
          "term":{
            "last_name":"brown"
          }
        }
      ]
    }
  }
}
'
curl -X GET 'http://localhost:9200/index6/_search?fields=_id&from=0&size=100&pretty' -d '{
  "query":{
    "bool":{
      "filter":[
        {
          "term":{
            "last_name":"brown"
          }
        }
      ]
    }
  }
}
'
curl -X GET 'http://localhost:9200/index6/_search?fields=_id&from=0&size=100&pretty' -d '{
  "query":{
    "filtered":{
      "filter":{
        "term":{
          "last_name":"brown"
        }
      }
    }
  }
}
'

The scores from each hit look like:

must: 3.8744829
filter: 0.0
filtered: 1.0

I have 10 indices. My query only hits the 6th index. Each index only has one shard. That particular index has 15698768 docs. I'm using two ec2 m3.xlarge boxes as the data nodes, 1 m3.large as the master.

polyfractal · November 9, 2015, 6:10pm

Yeah, that's definitely a statistically significant result (p=0.0041). And I'm afraid this is now exiting my domain knowledge, but hopefully @jpountz will have an idea

Just to appease my curiosity, could you also try running this to see if it has an effect:

curl -X GET 'http://localhost:9200/index6/_search?fields=_id&from=0&size=100&pretty' -d '{
   "query": {
      "bool": {
         "must": [
            {
               "match_all": {}
            }
         ],
         "filter": [
            {
               "term": {
                  "last_name": "brown"
               }
            }
         ]
      }
   }
}
'

milodky · November 9, 2015, 6:18pm

Sure.

curl -X GET 'http://localhost:9200/index6/_search?fields=_id&from=0&size=100&pretty' -d '{
  "query":{
    "bool":{
      "must":[
        {
          "match_all":{}
        }
      ],
      "filter":[
        {
          "term":{
            "last_name":"brown"
          }
        }
      ]
    }
  }
}
'

Took: [608, 472, 480, 642, 471, 482, 478, 472, 475, 473]
However this time, the score becomes 1.0.

milodky · November 9, 2015, 6:30pm

By the way, the last_name field is not_analyzed.

jpountz · November 19, 2015, 1:50pm

For the record, there are plans to optimize such queries that only have a single match_all scoring clause via https://issues.apache.org/jira/browse/LUCENE-6889

Topic		Replies	Views
Filter vs bool Query Performance , getting bad performance on using Filters Elasticsearch	7	2867	June 6, 2018
Bool filter performance and its alternatives Elasticsearch	1	715	September 10, 2019
Performance difference between searching with one field against with two fields Elasticsearch	6	1365	July 5, 2017
Bool filter vs query using query_string in Lucene syntax Elasticsearch	1	961	July 26, 2018
Bool query with filter is slower than filtered query Elasticsearch	9	1937	July 5, 2017

Performance difference using bool with filter

Related topics