Sum_other_doc_count higher than total docs

I have a single shard index with 1,464,467 docs. I am doing some aggregations on a keyword field. The terms aggregation is returning a "sum_other_doc_count": 1637376, which is more documents than I have in the index. I am trying to understand how it can be so inaccurate?

{
    "aggs" : {
        "cardinality_count" : { "cardinality" : { "field" : "dc:subjects" } },
        "missing_count" : { "missing" : { "field" : "dc:subjects" } }
    },
    "size" : 0
}

Returns:

        "took": 597,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": 1464467,
            "max_score": 0,
            "hits": []
        },
        "aggregations": {
            "cardinality_count": {
                "value": 105103
            },
            "missing_count": {
                "doc_count": 501609
            }
        }
    }

If I want to get the top 50 terms from my dataset my query is:

      "size": 0,
      "aggs": {
        "my_terms": {
          "terms": {
            "field": "dc:subjects",
            "size": 50
          }
        }
      }
    }

And the partial result is:

    "took": 128,
    "timed_out": false,
    "hits": {
        "total": 1464467,
    },
    "aggregations": {
        "my_terms": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 1637376,
            "buckets": [
                {
                    "key": "World War, 1939-1945",
                    "doc_count": 16259
                },
                {
                    "key": "Women",
                    "doc_count": 10369
                },
                {
                    "key": "Education",
                    "doc_count": 6699
                },
                {
                    "key": "Law",
                    "doc_count": 5248
                },
                {
                    "key": "Architecture",
                    "doc_count": 5018
                },
                {
                    "key": "Music",
                    "doc_count": 4813
                },

I have read the documentation saying the counts can be inaccurate. I would like to understand how "sum_other_doc_count" can be 1637376 (for 1 shard). I realise the cardinality is high but I don't think its excessive for more than a million docs.

"sum_other_doc_count": this number is the sum of the document counts for all buckets that are not part of the response.

What am I not understanding?

Do your documents have more than one value for dc:subjects? If so, a single document can be in multiple buckets for the terms aggregation. In other words, a document will be counted multiple times, if it has multiple values for dc:subjects (once for each value it has for dc:subjects).

That's how you end up with a sum_other_doc_count that is higher than your number of documents - the sum of the counts of all the remaining buckets (outside of the top 50 buckets) adds up to more than the number of documents.

I just checked my data. You are right, its a multi-valued field. Make perfect sense, thanks for your help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.