How to aggregate by doc_count?

Sergio_Blanco · December 17, 2019, 7:25am

I want to aggregate the result of other aggregation using Elasticsearch. I have created the first aggregation I need:

    "size":0,
    "query": {
        "bool": {
          "filter": {
              "match" : {"type": "Posts"}
          },
          "filter": {
              "match" : {"PostTypeId": "1"}
          }
        }
    },
    "aggs" : {
        "by_user": {
          "terms": {
            "field": "OwnerUserId"
          }
        }
    }

This query takes all the documents of type post that are questions ( PostTypeId = 1 ). Then, it aggregates by OwnerUserId , which counts the number of question posts of each user, giving the following result:

{'took': 0,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 10000, 'relation': 'gte'},
  'max_score': None,
  'hits': []},
 'aggregations': {'by_user': {'doc_count_error_upper_bound': 0,
   'sum_other_doc_count': 31053,
   'buckets': [{'key': '2230', 'doc_count': 223},
    {'key': '', 'doc_count': 177},
    {'key': '38304', 'doc_count': 158},
    {'key': '5997', 'doc_count': 144},
    {'key': '4048', 'doc_count': 130},
    {'key': '25813', 'doc_count': 119},
    {'key': '27826', 'doc_count': 119},
    {'key': '2633', 'doc_count': 115},
    {'key': '19919', 'doc_count': 114},
    {'key': '13938', 'doc_count': 111}]}}}

Now I want to do another aggregation over the results of the previous one: aggregate by doc_count , I mean grouping and counting the equal number of question posts. For the previous result, my desired result would be:

{'buckets': [{'key': '223', 'doc_count': 1},
    {'key': '177', 'doc_count': 1},
    {'key': '158', 'doc_count': 1},
    {'key': '144', 'doc_count': 1},
    {'key': '130', 'doc_count': 1},
    {'key': '119', 'doc_count': 2},
    {'key': '115', 'doc_count': 1},
    {'key': '114', 'doc_count': 1},
    {'key': '111', 'doc_count': 1}]}

Mark_Harwood · December 17, 2019, 9:30am

Hi Sergio,
What you're asking to do is very hard in a distributed index because it relies on first joining up a lot of remote data for each OwnerUserId. If you have many unique IDs then physical constraints like network speeds and the amount of memory that can be used in a single request mean this is too much computation to attempt in a single query.
The alternative is to physically organise the data so that related items are closer to hand. Using the new transforms api you can convert your event-centric data (users posting things) to an entity-centric index (summaries of each user's behaviour). Using this index makes analysis of user behaviours faster and simpler (and in some cases simply "possible")

Sergio_Blanco · December 17, 2019, 9:59am

But I do not mind about efficiency as I am doing some research. I only want to know if there exists any syntax that allows me to aggregate by doc_count even if it is not efficient. @Mark_Harwood

Mark_Harwood · December 17, 2019, 10:03am

The transforms API is about efficiency but it makes querying simpler too. It uses the aggregations framework to make the summary docs so does give you that "aggregations on an aggregation" solution.
Stage 1 is to aggregate the total posts for each user with the transform API.
Stage 2 is to do a histogram aggregation on the "total_posts" field in the users index you created in step 1.

Sergio_Blanco · December 17, 2019, 12:00pm

Can you show me how this query would be? I cannot find many examples of the transforms API and I cannot figure the query out. @Mark_Harwood

Hendrik_Muhs · December 17, 2019, 3:18pm

Hi Sergio

Examples can be found here.

For your use case, something like this should work:

PUT _transform/transform_1
{
  "source": {
    "index": "YOUR_SOURCE_INDEX",
    "query": {YOUR_FILTER_QUERY}
  },
  "dest": {
    "index": "YOUR_DEST_INDEX"
  },
  "pivot": {
    "group_by": {
      "OwnerUserId": {
        "terms": {
          "field": "OwnerUserId"
        }
      }
    },
    "aggregations": {
      "count": {
        "value_count": {
          "field": "OwnerUserId"
        }
      }
    }
  }
}

You probably want to make this a continuous transform in order to update the destination index automatically for new data. Checkout the sync parameter. The mentioned examples also contain some continuous transforms, another good resource can be found here.

As @Mark_Harwood pointed out already, the created YOUR_DEST_INDEX can be queried for further analysis and should solve your usecase.

Sergio_Blanco · December 17, 2019, 4:10pm

Thank you, that solved my usecase.

However, is it true that there is no syntax available to directly access the doc_count field returned by the aggregation I showed in the query of the first post comment, using a second nested terms aggregation? @Mark_Harwood @Hendrik_Muhs

Mark_Harwood · December 17, 2019, 4:13pm

Generally speaking - if something isn't scalable, we don't offer it and instead build something like the transforms API to make it scalable

Sergio_Blanco · December 17, 2019, 4:15pm

I understand.

Thank you so much for your time and extremely detailed answers !!

system · January 14, 2020, 4:15pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to get the count of the documents inside term aggregation that have doc_count as 1? Elasticsearch	1	495	January 11, 2018
Aggregating by count of another aggregation? Elasticsearch	1	370	July 5, 2017
Elasticsearch query to aggregate results of buckets Elasticsearch	1	509	July 5, 2017
Group by Aggregation doc_count Elasticsearch	1	656	November 14, 2018
Get sum of doc_count values in aggregation Elasticsearch	2	2139	August 25, 2018

How to aggregate by doc_count?

Related topics