Aggregate Query on a Large Number of Documents by filtering out fewer doc counts

rvshchwl · October 22, 2019, 7:41pm

I am trying to write an Aggregate Query for Elasticsearch, that can be performed on a large volume of data, but also filters out any buckets that have a small doc count.

The data I am using is structured as follows (relevant fields shown)

{
    "Timestamp" : 'date' format
    "Device_ID" : 'integer' format,
    "Doc_ID" : 'integer' format
}

The Device ID field is not a unique identifier, and the Date Range is is several months.

The data comes in every second, with a high throughput of over a 1,000 documents per second, sometimes even more.

The aggregation I want to do is the following:

Aggregate first by a Date Histogram, of every 1 minute
Sub Aggregate on Device ID, mapping Device-ID to 'Num-Documents' for that device ID within that 1 minute.

I have tried the following query already:

  "aggs" : {
    "resample" : {
      "date_histogram": {
        "field": "Timestamp",
        "interval": "minute"
      },
  "aggs" : {
    "devices" : {
      "terms" : { 
        "field" : "Device_ID"
    }
  }
}

But it throws a throughput exception,

Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]."

There are two things that I think could be done, but I am not sure how to write a query to support it.

Filter out all Documents within a timestamp where the Document counts from the Sub-aggregation on Device ID is less than 100. From my own knowledge of the data, I know that this will filter out 70% of the data, which is not needed for our further data processing steps.
Return the documents associated with each Bucket, so they can be mapped back. This would also solve the problem, but the query doesn't return these.

Can these steps be done to get the result I want?

Glen_Smith · October 23, 2019, 3:18am

Ravish,

There are a couple of strategies you can try here.

Keeping with your suggesting, you can specify the minimum occurrences of a field value in the terms agg before that value is allocated a bucket using the min_doc_count parameter. However, as is mentioned in that doc, the collection per-shard can't "know" how many occurrences of each value there are in the entire index, so the coordinating node will need to collect a lot of the data that will be discarded for the end aggregation response. You can experiment with this to see if it is a solution; I'm not sure whether the bucket count might hit the circuit breaker before the minimum doc count rule is applied.

Another approach, designed specifically to address this issue, is to partition your request. Then you are only creating a fraction of the buckets for each request you make.

system · November 20, 2019, 3:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Group docs with same internal ID and then aggregate in to buckets by number of matches Elasticsearch	1	338	July 30, 2019
Too many aggregation buckets Elasticsearch	3	3073	July 5, 2017
Histogram aggregation Elasticsearch Elasticsearch	1	337	November 15, 2019
Aggregation on most recent document in a group Elasticsearch	5	900	May 22, 2018
Elasticsearch query to filter count by aggregate Elasticsearch	1	2570	October 1, 2018

Aggregate Query on a Large Number of Documents by filtering out fewer doc counts

Related topics