Performance difference between searching with one field against with two fields

milodky · November 20, 2015, 12:39am

Hi,

I have two types of query as follow:
1)

{
  "query":{
    "filtered":{
      "filter":{
        "bool":{
          "must":[
            {
              "term":{
                "last_name":"brown"
              }
            }
          ]
        }
      }
    }
  }
}

{
  "query":{
    "filtered":{
      "filter":{
        "bool":{
          "must":[
            {
              "term":{
                "first_name":"john"
              }
            },
            {
              "term":{
                "last_name":"brown"
              }
            }
          ]
        }
      }
    }
  }
}

The first one has 5950139 entries while the second has 52612. I'm only retrieving 100 entries. Since the second one has to match two conditions, I'm expecting it to be slower for it at least needs to do an intersection.

However the the first one takes around 300ms while the second one only takes around 20ms.

So is this expected? If so, what's the logic behind it?

Thanks!

Ivan · November 22, 2015, 5:44pm

Did you warm up the caches before executing the queries? Are those fields
populated for every document and is the cardinality the same? Bitset
intersections are fast in Lucene.

Ivan

jpountz · November 23, 2015, 3:27pm

Searching across different fields is no problem for Elasticsearch and intersections tend to be very fast... When doing simple queries (term queries, and combinations of term queries through the bool query), the time it takes to execute a query usually mostly depends on the number of matches. Since your 2nd query matches fewer documents than the first one, I'm not surprised it runs much faster.

milodky · December 7, 2015, 1:47am

Sorry for my late reply. Yes I do warmup the caches. First name and last name are a must for every document.

milodky · December 7, 2015, 1:58am

Thanks jpountz for your reply!

So here is what I thought, correct me if I'm wrong:

So searching for first name it took t1 milliseconds and returned n1 records and searching for last name it took t2 milliseconds and returned n2 records. These two steps can be executed in parallel or sequentially:

use two threads to search last name and first name at the same time and do an intersection on n1 and n2
filter the results from n1 based on the matched last name and finally returns n2.

So the first one(parallel) will take max(t1, t2) plus the intersection time while the second takes t1 + t2. The matched documents of john and brown are roughly the same.

That's why I don't quite understand the actual results which contradict my theory...

Thanks,

Xiaoting

jpountz · December 8, 2015, 8:11am

I know it can be confusing. First, threads are not involved: Elasticsearch always runs a query using one thread per shard. The reason why the 2d query is faster is that it matches fewer documents. When you run a query, Elasticsearch needs to iterate over ALL matches in order to select the top-N (top-100 in your case). Since there are 5950139 docs to examine for the first query and 52612 for the second one, the second query is faster.

Topic		Replies	Views
Performance difference using bool with filter Elasticsearch	11	2242	July 5, 2017
Simple Search Query Performance Elasticsearch	3	481	July 5, 2017
Bool filter vs query using query_string in Lucene syntax Elasticsearch	1	962	July 26, 2018
Filter vs bool Query Performance , getting bad performance on using Filters Elasticsearch	7	2887	June 6, 2018
Massive perf difference with filter versus filtered query Elasticsearch	4	571	July 6, 2017

Performance difference between searching with one field against with two fields

Related topics