Multilevel aggregation does not show correct results as compare to total hits

Hi

I am facing issue in aggregation withing aggregation as results count does not match with total hits, but when I search aggregation on single record without second level aggregation then shows correct counts.

I have created indices as

  • analytics-live-2021-06
  • analytics-live-2021-07
  • analytics-live-2021-08
  • analytics-live-2021-09

All indices are differentiated by _type as impression,view,call,lead etc

And documents in all indices are like:

{
        "_index": "analytics-live-2021-07",
        "_type": "impression",
        "_id": "AXqH8j2vAbeLubSJF1JN",
        "_score": 12.767446,
        "_source": {
          "event_source": "android",
          "uid": "9100483",
          "entity_type": "node",
          "url": "search_screen",
          "bundle": "classified",
          "entity_id": "45870510",
          "entity_uid": "xxx",
          "entity_url": "xyz/45870510",
          "drupal": 7,
          "ip": "37.208.130.216",
          "user_agent": "Dalvik/2.1.0 (Linux; U; Android 10; Redmi Note 9 Pro MIUI/V12.0.3.0.QJZMIXM)",
          "country": "QA",
          "@timestamp": "2021-07-09T00:07:17"
        }
      },
      {
        "_index": "analytics-test-2021-09-new",
        "_type": "view",
        "_id": "AXugmm1ZoEhud55w2U8X",
        "_score": 1,
        "_source": {
          "event_source": "android",
          "uid": "0",
          "entity_type": "node",
          "url": "ad_detail_screen",
          "bundle": "classified",
          "entity_id": 45872435,
          "entity_uid": "xxx",
          "entity_url": "xyz/45872435",
          "drupal": 7,
          "ip": "2402:8100:2842:2460::f903:ee5",
          "user_agent": "Dalvik/2.1.0 (Linux; U; Android 11; CPH2121 Build/RP1A.200720.011)",
          "country": "IN",
          "@timestamp": "2021-09-01T12:04:40"
        }
      },

There are total 1488349910 documents in all indices.

Now if run query for single entity_id field to get bucket aggregation of _type as:

GET analytics-live-2021*/_search
{
  "size":0,
  "query":{
    "bool":{
      "must":{
        "terms":{
          "entity_id":[45870510]
        }
      },
      "filter":{
        "range":{
          "@timestamp":{"gt":"2021-06-01T00:00:00"}
          
        }
      }
    }
  },
  "aggs":{
    "_type":{
      "terms":{"field":"_type"}
    }
    
  }
}

It gives following results which is accurate:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 20,
    "successful": 20,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 8193,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "_type": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "impression",
          "doc_count": 8148
        },
        {
          "key": "view",
          "doc_count": 42
        },
        {
          "key": "reveal",
          "doc_count": 3
        }
      ]
    }
  }
}

Entity ID: 45870510
Total Hits: 8193
Aggregation: impression: 8148, view: 42, reveal: 3 === Total 8193

Issue:

But now I want to get same aggregation per entity_id but it shows incorrect results count and comes with the huge difference

Query

GET analytics-live-2021*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": {
        "terms": {
          "entity_id": [
            45870510,45327508
          ]
        }
      },
      "filter": {
        "range": {
          "@timestamp": {
            "gt": "2021-06-01T00:00:00"
          }
        }
      }
    }
  },
  "aggs": {
    "_type": {
      "terms": {
        "field": "entity_id.keyword"
      },
      "aggs": {
        "_type": {
          "terms": {
            "field": "_type"
          }
        }
      }
    }
  }
}

Results with wrong count as compare to total hits

{
  "took": 871,
  "timed_out": false,
  "_shards": {
    "total": 20,
    "successful": 20,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 11498,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "_type": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "45870510",
          "doc_count": 2292,
          "_type": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "impression",
                "doc_count": 2280
              },
              {
                "key": "view",
                "doc_count": 12
              }
            ]
          }
        },
        {
          "key": "45327508",
          "doc_count": 1009,
          "_type": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "impression",
                "doc_count": 975
              },
              {
                "key": "view",
                "doc_count": 32
              },
              {
                "key": "reveal",
                "doc_count": 2
              }
            ]
          }
        }
      ]
    }
  }
}

Entity ID: 45870510
Aggregation: impression: 2280, view: 12, reveal: xx === Total 2292

Please help me to get correct results against each entity_id in multilevel aggregation

I need aggregated of _type field for each entity_id with actual count

Khurram Fraz

Is this for Enterprise Search or Elasticsearch?

This is for elasticsearch

Ok let me move it for you (ps - there is a pretty big warning when you make a new post that category is not for Elasticsearch stuff :wink: ).

Thanks for your reply, I could not find proper category thats why I put here

That is curious.
What version of elasticsearch are you on?
Is entity_id mapped as a numeric field (long/integer etc)? If so does it behave differently if the search is on the entity_id.keyword field?

I am using Amazon Elasticsearch service version 5.6

entity_id is actually a numeric field and in many documents it is saved as text. so it is mixed text and number.

search with entity_id Or entity_id.keyword gives different results but not actual results. entity_id gives more results than entity_id.keyword.

That's quite an old version (released 4 years ago) so may well have issues that have been fixed already.

Showing the related JSON for mappings and documents for your example would help dig deeper.