Performance difference using bool with filter


(milodky) #1

Hi,

I'm currently testing the performance of ElasticSearch 2.0.

My original query is:

"query":{
    "filtered":{
      "filter":{
        "term":{
          "last_name":"brown"
        }
      }
    }
  }

The took value is roughly 260ms.

However when I switched to bool with filter as follow(since filtered is deprecated):

  "query":{
    "bool":{
      "filter":[
        {
          "term":{
            "last_name":"brown"
          }
        }
      ]
    }
  }

I expected two took values to be similar. But this time the took value bumped up to around 300ms.

The size of the matched records is 5.9 million. And I don't care about the score at all for now. Both tests are done on ES 2.0 on the same cluster.

Is there any better query?

Thanks,

Xiaoting


(Zachary Tong) #2

Hm, it should be same/similar ... under the covers they are executing almost exactly the same query.

Did you run the test multiple times for both queries? I'm not saying there isn't an effect there, but the time difference is small enough it could simply be random chance (e.g. noise) and not a statistically significant effect.

Ideally:

  1. Run the filtered query a few dozen times to warm up the OS cache and JVM
  2. Run it another 20 times and record all values
  3. Shut the server down. Restart
  4. Run bool a few dozen times to warm up OS cache and JVM
  5. Run it another 20 times and record all values

Then compare the values (preferably with a T-Test) to see if there is a real difference. It's not fun, but benchmarking is so easily skewed by the smallest things...you really have to take a statistical population approach.

With that said, I checked what Lucene queries are being run. The filtered is doing this in 2.0:

GET /test/_validate/query?explain=true
{
   "query": {
      "filtered": {
         "filter": {
            "term": {
               "last_name": "brown"
            }
         }
      }
   }
}


{
   "valid": true,
   "_shards": {...},
   "explanations": [
      {
         "index": "test",
         "valid": true,
         "explanation": "+*:* #last_name:brown"
      }
   ]
}

Whereas the bool is doing this:

GET /test/_validate/query?explain=true
{
   "query": {
      "bool": {
         "filter": [
            {
               "term": {
                  "last_name": "brown"
               }
            }
         ]
      }
   }
}

{
   "valid": true,
   "_shards": {...},
   "explanations": [
      {
         "index": "test",
         "valid": true,
         "explanation": "#last_name:brown"
      }
   ]
}

Which is slightly different (the filtered is including a match_all query in there). If you want to try, you could try adding a match_all to the bool and see if it improves. I'd be surprised if it does, but perhaps there is an optimization that's not being used otherwise?

GET /test/_validate/query?explain=true
{
   "query": {
      "bool": {
         "must": [
            {
               "match_all": {}
            }
         ],
         "filter": [
            {
               "term": {
                  "last_name": "brown"
               }
            }
         ]
      }
   }
}

However...I'll stress my first point, it's likely just noise. Make sure you run these a number of times to even out noise, try to isolate any other processes from touching the cluster, etc.


(Zachary Tong) #3

Also /cc @jpountz in case he's interested / has input. :yum:


(milodky) #4

Yes. I ran each for at least 20 times and the results appeared the same.


(Zachary Tong) #5

Interesting. Did you try adding the must to the bool to see if it gives the same performance boost? Do both queries return the same number of documents? How many documents are in your index (and how many are being returned)?

If I get some time this weekend I'll see if I can work up a similar benchmark locally to reproduce.


(Adrien Grand) #6

Both queries should behave roughly the same indeed. As Zach mentioned it's not entirely clear whether the difference is significative here.

As a side note, if the query that you want to express is "find everything that matches this term regardless of scoring", then it's more appropriate to build a constant_score than a bool query with a single filter clause or even a filtered query that only has a query. Not that it should change performance (bool queries with a single filter clause should rewrite to a constant-score query) but it should make your queries a bit more readable.


(milodky) #7

I tried must for 10 times and the results are [340, 339, 341, 338, 335, 335, 336, 433, 345, 340]
For filter: [337, 405, 340, 338, 336, 342, 336, 334, 337, 346]
For filtered: [305, 373, 308, 309, 308, 309, 305, 307, 311, 306]

The queries are listed below:

curl -X GET 'http://localhost:9200/index6/_search?fields=_id&from=0&size=100&pretty' -d '{
  "query":{
    "bool":{
      "must":[
        {
          "term":{
            "last_name":"brown"
          }
        }
      ]
    }
  }
}
'
curl -X GET 'http://localhost:9200/index6/_search?fields=_id&from=0&size=100&pretty' -d '{
  "query":{
    "bool":{
      "filter":[
        {
          "term":{
            "last_name":"brown"
          }
        }
      ]
    }
  }
}
'
curl -X GET 'http://localhost:9200/index6/_search?fields=_id&from=0&size=100&pretty' -d '{
  "query":{
    "filtered":{
      "filter":{
        "term":{
          "last_name":"brown"
        }
      }
    }
  }
}
'

The scores from each hit look like:

must: 3.8744829
filter: 0.0
filtered: 1.0

I have 10 indices. My query only hits the 6th index. Each index only has one shard. That particular index has 15698768 docs. I'm using two ec2 m3.xlarge boxes as the data nodes, 1 m3.large as the master.


(Zachary Tong) #8

Yeah, that's definitely a statistically significant result (p=0.0041). And I'm afraid this is now exiting my domain knowledge, but hopefully @jpountz will have an idea :smile:

Just to appease my curiosity, could you also try running this to see if it has an effect:

curl -X GET 'http://localhost:9200/index6/_search?fields=_id&from=0&size=100&pretty' -d '{
   "query": {
      "bool": {
         "must": [
            {
               "match_all": {}
            }
         ],
         "filter": [
            {
               "term": {
                  "last_name": "brown"
               }
            }
         ]
      }
   }
}
'

(milodky) #9

Sure.

curl -X GET 'http://localhost:9200/index6/_search?fields=_id&from=0&size=100&pretty' -d '{
  "query":{
    "bool":{
      "must":[
        {
          "match_all":{}
        }
      ],
      "filter":[
        {
          "term":{
            "last_name":"brown"
          }
        }
      ]
    }
  }
}
'

Took: [608, 472, 480, 642, 471, 482, 478, 472, 475, 473]
However this time, the score becomes 1.0.


(milodky) #10

By the way, the last_name field is not_analyzed.


(Adrien Grand) #11

For the record, there are plans to optimize such queries that only have a single match_all scoring clause via https://issues.apache.org/jira/browse/LUCENE-6889


(system) #12