Use nested doc_count in terms aggregation

tomr · March 24, 2017, 8:23am

I have documents of the form

{
  "item": "name",
  "@timestamp": <date>,
  "category": [
    { "key": "something", "doc_count": num },
    { "key": "otherthing", "doc_count": num },
    { "key": "thirdthing", "doc_count": num }
  ]
}

Mapping has [category][key] as keyword and [category][doc_count] as long.

There are a few thousand documents for each value of [item], with a lot of repetition of the values for key but variation in the value for doc_count. The data is derived from 5-minutely snapshots of high traffic web logs. I'm trying to run something like a terms agg, but I want to add up all the values of doc_count rather than just counting the instances of key itself. Ultimately I want to show the top N values based on those totals. (I realise the counts will be approximate a la sharding approximations - that's fine)

It's unclear to me whether I should be trying a nested agg combo of some sort, or perhaps a terms agg with a sum subagg. A little clue as to how I should get started would be very much appreciated.

tia,
Tom

tomr · March 27, 2017, 7:11am

I thnk I've figured this out, thanks to reading this a few times: http://stackoverflow.com/questions/30583679/elasticsearch-terms-and-sum-aggregation

In short, it's a nested agg with path set to category, with a terms agg on category.key under that, and a sum agg on category.doc_count under the terms agg.

I'm not yet sure whether I'm actually getting an accurate figure - in other words, whether all sums are calculated and then I get the top 10 results, or whether only the top 10 most popular terms have their sums calculated. Given the enormous response time I'm seeing, I guess it's the former. Next I'll look into whether I can do a top_hits agg instead of either the terms or sum agg to improve performance.

For the time being, my query looks like

{
  "aggs": {
    "nested_categories": {
      "nested": {
        "path": "category"
      },
      "aggs": {
        "category_terms": { 
          "terms": {
            "field": "category.key"
          },
          "aggs": {
            "category_sums": {
              "sum": { 
                "field": "category.doc_count"  
              }
            }
          }
        }
      }
    }
  }

The output produced looks like:

{
  "took": 22000,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 152872,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "nested_categories": {
      "doc_count": 679153,
      "category_terms": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 174340,
        "buckets": [
          {
            "key": "key1",
            "doc_count": 124980,
            "client_url_sums": {
              "value": 10920199
            }
          },
          {
            "key": "key2",
            "doc_count": 124149,
            "category_sums": {
              "value": 10714383
            }
          },
          {
            "key": "key3",
            "doc_count": 93561,
            "category_sums": {
              "value": 16788697
            }
          }
        ]
      }
    }
  }
}

system · April 24, 2017, 7:11am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aggregate doc_count per document and not per nested item? Elasticsearch	2	393	February 6, 2020
Wrong doc_count on Aggregations Elasticsearch	7	3839	July 5, 2017
Aggregation document count issue Elasticsearch	4	1358	August 19, 2018
Nested Aggregation, count of parents Elasticsearch	1	468	August 8, 2017
How to improve accuracy of aggregation doc_count Elasticsearch	1	601	October 15, 2018

Use nested doc_count in terms aggregation

Related topics