Search query total hits not adding up when using multiple queries

I have encountered a strange problem whereby the total hits don't seem to be adding up when I use multiple filters.

I have an index with around 40 million documents. A typical document looks like something like this:

{
  "postcodePrefix": "BD1",
  "transactions": [
    {
      "price": 100000,
      "category": "A",
      "date": "2020-01-10"
    },
    {
      "price": 200000,
      "category": "B",
      "date": "2020-01-20"
    }
  ]
}

When I do simple query, for example:

POST /INDEXNAME/_search?size=0
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "postcodePrefix.keyword": "BD1"
          }
        }
      ]
    }
  }
}

The total hits comes back as 2766 (so far so good)

If I then add an additional nested query, for example:

POST /INDEXNAME/_search?size=0
{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "transactions",
            "query": {
              "bool": {
                "should": [
                  {
                    "match": {
                      "transactions.category": "A"
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "match": {
            "postcodePrefix.keyword": "BD1"
          }
        }
      ]
    }
  }
}

The total hits now comes back as 2230 (no problem yet)

Last but not least, if i run the nested query again but this time with category "B", the total hits now comes back as 705

(Please note that in the data there are only 2 possible categories, "A" and "B")

What i don't understand is how the following total hits are calculated:

  • No category specified: 2766 hits
  • Category A: 2230 hits
  • Category B: 705 hits

Shouldn't the query where no category is specified return 2230 + 705 = 2935 hits?

Easiest way to solve is run query that excludes documents with ("postcodePrefix.keyword": "BD1") AND ("transactions.category": "A" OR "transactions.category": "B") and see what you get.

@Vinayak_Sapre That would just return the total hits for everything else. Can you expand on your suggestion?

If you have a document with 2 nested objects -one is A and one is B, then that single document is counted in both of the above queries. The sets overlap.

Thanks @Mark_Harwood I think I understand now.

So in my case, on the typical document example above, Elasticsearch see's that as 3 documents and therefore when I query with just a postcodePrefix it returns 1 hit but if I add an additional query with category "A" then it returns 2 hits, the root document + one of the nested documents. Is that right?

Whats important in my case is the transactions count. Is there a way around this so that in this case the count only returns 1 instead of 2? like a query within a query?

To be accurate, elasticsearch sees it as one document that falls into multiple sets.
Let's take this web page as an example document - it might have these properties.

 { "date" : "27/7/20", "contributors" : ["mark", "anton", "vinayak"] ... }

So this document belongs to both your set of topics and mine.

No, I don't think so. When you say "additional query" it depends on if you mean as a Boolean AND or an OR (narrowing or expanding the criteria). If you expand then you might get more hits but if your criteria is another MUST it will narrow and you'll only ever get the same or fewer hits than the first query.

@Mark_Harwood when I mentioned additional query i meant that there were 2 conditions inside the must array:

{
  "query": {
    "bool": {
      "must": [
      // condition 1 - must match postcodePrefix
      // condition 2 - must match category A
      ]
    }
  }
}

Is the above syntax correct? I read that as an AND therefore narrowing my search

Correct. That should narrow.
Can you post the docs/query/mappings that demonstrate an expansion? That part doesn't make sense seeing the counts go up with more MUSTs.

@Mark_Harwood i think I've been thinking about my search queries the wrong way and what you mentioned above regarding the overlap now makes sense.

My first query where i don't specify a category is wrong because thats just returning the number of documents where the postcodePrefix matches but thats not actually what i want. I need to know the number of transactions inside of that document i.e. the inner_hits count.

I'm going to explore inner_hits further but I'm now also questioning whether i should modify my mapping and not have the transactions as nested documents.

PS. i saw your video on " Wrestling Techniques Using Elasticsearch's Aggregations" and coincidentally I'm also working the UK Price Paid Data set.

Good to hear.

It made sense for my analysis at least to have transactions as their own documents and include the postcode as a term.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.