Performance difference between searching with one field against with two fields


(milodky) #1

Hi,

I have two types of query as follow:
1)

{
  "query":{
    "filtered":{
      "filter":{
        "bool":{
          "must":[
            {
              "term":{
                "last_name":"brown"
              }
            }
          ]
        }
      }
    }
  }
}
{
  "query":{
    "filtered":{
      "filter":{
        "bool":{
          "must":[
            {
              "term":{
                "first_name":"john"
              }
            },
            {
              "term":{
                "last_name":"brown"
              }
            }
          ]
        }
      }
    }
  }
}

The first one has 5950139 entries while the second has 52612. I'm only retrieving 100 entries. Since the second one has to match two conditions, I'm expecting it to be slower for it at least needs to do an intersection.

However the the first one takes around 300ms while the second one only takes around 20ms.

So is this expected? If so, what's the logic behind it?

Thanks!


(Ivan Brusic) #2

Did you warm up the caches before executing the queries? Are those fields
populated for every document and is the cardinality the same? Bitset
intersections are fast in Lucene.

Ivan


(Adrien Grand) #3

Searching across different fields is no problem for Elasticsearch and intersections tend to be very fast... When doing simple queries (term queries, and combinations of term queries through the bool query), the time it takes to execute a query usually mostly depends on the number of matches. Since your 2nd query matches fewer documents than the first one, I'm not surprised it runs much faster.


(milodky) #4

Sorry for my late reply. Yes I do warmup the caches. First name and last name are a must for every document.


(milodky) #5

Thanks jpountz for your reply!

So here is what I thought, correct me if I'm wrong:

So searching for first name it took t1 milliseconds and returned n1 records and searching for last name it took t2 milliseconds and returned n2 records. These two steps can be executed in parallel or sequentially:

  1. use two threads to search last name and first name at the same time and do an intersection on n1 and n2
  2. filter the results from n1 based on the matched last name and finally returns n2.

So the first one(parallel) will take max(t1, t2) plus the intersection time while the second takes t1 + t2. The matched documents of john and brown are roughly the same.

That's why I don't quite understand the actual results which contradict my theory...

Thanks,

Xiaoting


(Adrien Grand) #6

I know it can be confusing. First, threads are not involved: Elasticsearch always runs a query using one thread per shard. The reason why the 2d query is faster is that it matches fewer documents. When you run a query, Elasticsearch needs to iterate over ALL matches in order to select the top-N (top-100 in your case). Since there are 5950139 docs to examine for the first query and 52612 for the second one, the second query is faster.


(system) #7