I am trying to write an Aggregate Query for Elasticsearch, that can be performed on a large volume of data, but also filters out any buckets that have a small doc count.
The data I am using is structured as follows (relevant fields shown)
{
"Timestamp" : 'date' format
"Device_ID" : 'integer' format,
"Doc_ID" : 'integer' format
}
The Device ID field is not a unique identifier, and the Date Range is is several months.
The data comes in every second, with a high throughput of over a 1,000 documents per second, sometimes even more.
The aggregation I want to do is the following:
- Aggregate first by a Date Histogram, of every 1 minute
- Sub Aggregate on Device ID, mapping Device-ID to 'Num-Documents' for that device ID within that 1 minute.
I have tried the following query already:
"aggs" : {
"resample" : {
"date_histogram": {
"field": "Timestamp",
"interval": "minute"
},
"aggs" : {
"devices" : {
"terms" : {
"field" : "Device_ID"
}
}
}
But it throws a throughput exception,
Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]."
There are two things that I think could be done, but I am not sure how to write a query to support it.
-
Filter out all Documents within a timestamp where the Document counts from the Sub-aggregation on Device ID is less than 100. From my own knowledge of the data, I know that this will filter out 70% of the data, which is not needed for our further data processing steps.
-
Return the documents associated with each Bucket, so they can be mapped back. This would also solve the problem, but the query doesn't return these.
Can these steps be done to get the result I want?