Aggregate Query on a Large Number of Documents by filtering out fewer doc counts

I am trying to write an Aggregate Query for Elasticsearch, that can be performed on a large volume of data, but also filters out any buckets that have a small doc count.

The data I am using is structured as follows (relevant fields shown)

{
    "Timestamp" : 'date' format
    "Device_ID" : 'integer' format,
    "Doc_ID" : 'integer' format
}

The Device ID field is not a unique identifier, and the Date Range is is several months.

The data comes in every second, with a high throughput of over a 1,000 documents per second, sometimes even more.

The aggregation I want to do is the following:

  • Aggregate first by a Date Histogram, of every 1 minute
  • Sub Aggregate on Device ID, mapping Device-ID to 'Num-Documents' for that device ID within that 1 minute.

I have tried the following query already:

  "aggs" : {
    "resample" : {
      "date_histogram": {
        "field": "Timestamp",
        "interval": "minute"
      },
  "aggs" : {
    "devices" : {
      "terms" : { 
        "field" : "Device_ID"
    }
  }
}

But it throws a throughput exception,

Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]."

There are two things that I think could be done, but I am not sure how to write a query to support it.

  • Filter out all Documents within a timestamp where the Document counts from the Sub-aggregation on Device ID is less than 100. From my own knowledge of the data, I know that this will filter out 70% of the data, which is not needed for our further data processing steps.

  • Return the documents associated with each Bucket, so they can be mapped back. This would also solve the problem, but the query doesn't return these.

Can these steps be done to get the result I want?

Ravish,

There are a couple of strategies you can try here.

Keeping with your suggesting, you can specify the minimum occurrences of a field value in the terms agg before that value is allocated a bucket using the min_doc_count parameter. However, as is mentioned in that doc, the collection per-shard can't "know" how many occurrences of each value there are in the entire index, so the coordinating node will need to collect a lot of the data that will be discarded for the end aggregation response. You can experiment with this to see if it is a solution; I'm not sure whether the bucket count might hit the circuit breaker before the minimum doc count rule is applied.

Another approach, designed specifically to address this issue, is to partition your request. Then you are only creating a fraction of the buckets for each request you make.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.