I have a single shard index with 1,464,467 docs. I am doing some aggregations on a keyword field. The terms aggregation is returning a "sum_other_doc_count": 1637376, which is more documents than I have in the index. I am trying to understand how it can be so inaccurate?
{
"aggs" : {
"cardinality_count" : { "cardinality" : { "field" : "dc:subjects" } },
"missing_count" : { "missing" : { "field" : "dc:subjects" } }
},
"size" : 0
}
Returns:
"took": 597,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1464467,
"max_score": 0,
"hits": []
},
"aggregations": {
"cardinality_count": {
"value": 105103
},
"missing_count": {
"doc_count": 501609
}
}
}
If I want to get the top 50 terms from my dataset my query is:
"size": 0,
"aggs": {
"my_terms": {
"terms": {
"field": "dc:subjects",
"size": 50
}
}
}
}
And the partial result is:
"took": 128,
"timed_out": false,
"hits": {
"total": 1464467,
},
"aggregations": {
"my_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 1637376,
"buckets": [
{
"key": "World War, 1939-1945",
"doc_count": 16259
},
{
"key": "Women",
"doc_count": 10369
},
{
"key": "Education",
"doc_count": 6699
},
{
"key": "Law",
"doc_count": 5248
},
{
"key": "Architecture",
"doc_count": 5018
},
{
"key": "Music",
"doc_count": 4813
},
I have read the documentation saying the counts can be inaccurate. I would like to understand how "sum_other_doc_count" can be 1637376 (for 1 shard). I realise the cardinality is high but I don't think its excessive for more than a million docs.
"sum_other_doc_count": this number is the sum of the document counts for all buckets that are not part of the response.
What am I not understanding?