How to aggregate by doc_count?

I want to aggregate the result of other aggregation using Elasticsearch. I have created the first aggregation I need:

    "size":0,
    "query": {
        "bool": {
          "filter": {
              "match" : {"type": "Posts"}
          },
          "filter": {
              "match" : {"PostTypeId": "1"}
          }
        }
    },
    "aggs" : {
        "by_user": {
          "terms": {
            "field": "OwnerUserId"
          }
        }
    }

This query takes all the documents of type post that are questions ( PostTypeId = 1 ). Then, it aggregates by OwnerUserId , which counts the number of question posts of each user, giving the following result:

{'took': 0,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 10000, 'relation': 'gte'},
  'max_score': None,
  'hits': []},
 'aggregations': {'by_user': {'doc_count_error_upper_bound': 0,
   'sum_other_doc_count': 31053,
   'buckets': [{'key': '2230', 'doc_count': 223},
    {'key': '', 'doc_count': 177},
    {'key': '38304', 'doc_count': 158},
    {'key': '5997', 'doc_count': 144},
    {'key': '4048', 'doc_count': 130},
    {'key': '25813', 'doc_count': 119},
    {'key': '27826', 'doc_count': 119},
    {'key': '2633', 'doc_count': 115},
    {'key': '19919', 'doc_count': 114},
    {'key': '13938', 'doc_count': 111}]}}}

Now I want to do another aggregation over the results of the previous one: aggregate by doc_count , I mean grouping and counting the equal number of question posts. For the previous result, my desired result would be:

{'buckets': [{'key': '223', 'doc_count': 1},
    {'key': '177', 'doc_count': 1},
    {'key': '158', 'doc_count': 1},
    {'key': '144', 'doc_count': 1},
    {'key': '130', 'doc_count': 1},
    {'key': '119', 'doc_count': 2},
    {'key': '115', 'doc_count': 1},
    {'key': '114', 'doc_count': 1},
    {'key': '111', 'doc_count': 1}]}

Hi Sergio,
What you're asking to do is very hard in a distributed index because it relies on first joining up a lot of remote data for each OwnerUserId. If you have many unique IDs then physical constraints like network speeds and the amount of memory that can be used in a single request mean this is too much computation to attempt in a single query.
The alternative is to physically organise the data so that related items are closer to hand. Using the new transforms api you can convert your event-centric data (users posting things) to an entity-centric index (summaries of each user's behaviour). Using this index makes analysis of user behaviours faster and simpler (and in some cases simply "possible")

1 Like

But I do not mind about efficiency as I am doing some research. I only want to know if there exists any syntax that allows me to aggregate by doc_count even if it is not efficient. @Mark_Harwood

The transforms API is about efficiency but it makes querying simpler too. It uses the aggregations framework to make the summary docs so does give you that "aggregations on an aggregation" solution.
Stage 1 is to aggregate the total posts for each user with the transform API.
Stage 2 is to do a histogram aggregation on the "total_posts" field in the users index you created in step 1.

1 Like

Can you show me how this query would be? I cannot find many examples of the transforms API and I cannot figure the query out. @Mark_Harwood

Hi Sergio

Examples can be found here.

For your use case, something like this should work:

PUT _transform/transform_1
{
  "source": {
    "index": "YOUR_SOURCE_INDEX",
    "query": {YOUR_FILTER_QUERY}
  },
  "dest": {
    "index": "YOUR_DEST_INDEX"
  },
  "pivot": {
    "group_by": {
      "OwnerUserId": {
        "terms": {
          "field": "OwnerUserId"
        }
      }
    },
    "aggregations": {
      "count": {
        "value_count": {
          "field": "OwnerUserId"
        }
      }
    }
  }
}

You probably want to make this a continuous transform in order to update the destination index automatically for new data. Checkout the sync parameter. The mentioned examples also contain some continuous transforms, another good resource can be found here.

As @Mark_Harwood pointed out already, the created YOUR_DEST_INDEX can be queried for further analysis and should solve your usecase.

1 Like

Thank you, that solved my usecase.

However, is it true that there is no syntax available to directly access the doc_count field returned by the aggregation I showed in the query of the first post comment, using a second nested terms aggregation? @Mark_Harwood @Hendrik_Muhs

Generally speaking - if something isn't scalable, we don't offer it and instead build something like the transforms API to make it scalable

I understand.

Thank you so much for your time and extremely detailed answers !!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.