Handeling nested bucketing when the data set is very large

shivam_chaurasia · March 4, 2020, 7:31am

Hi there I am new to Elasticsearch and my use case in this regard is pretty simple, I am running a query, with aggregations, which returns result > 10000 documents. I am navigating through the buckets and getting relevant data.
Below is the query

POST staging-order/_search
{
"size": 10000
,
"aggs": {
"order_ranges": {
"range": {
"field": "order.delivery_date",
"ranges": [
{
"from": "2019-01-30T06:33:07+00:00",
"to": "2020-01-30T06:33:07+00:00"
},
{
"from": "2020-01-30T06:33:07+00:00",
"to": "2020-03-30T06:33:07+00:00"
},
{
"from": "2020-03-30T06:33:07+00:00",
"to": "2020-07-30T06:33:07+00:00"
}
]
}
, "aggs": {
"customer_email": {
"terms": {
"field": "order.customer_id","size": 10000
}
}
}
}
}
}

The parameters of from and to can be optional.
While running this I am getting too_many_buckets_exception.

Changing the index.max_result_window looks more of a temporary hack with a good possibility of putting extra strain on resources. My environment is AWS driven so I don't know how to change the settings. I tried to change it using below query

POST _cluster/settings
{
"transient": {
"search.max_buckets": 20000
}
}

But got an error - "Your request: '/_cluster/settings' is not allowed for verb: POST"

Thanks for the help!

Wolfram_Haussig · March 4, 2020, 8:09am

Hi,

The cluster settings API requires using PUT instead of POST. Also be aware that transient changes are lost after a cluster restart.

Another option might be to use paging: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-from-size

You would read the results in 1k bundles for example and then request the next resultset but this depends on the usecase and if you are able to make multiple requests.

Best regards
Wolfram

shivam_chaurasia · March 4, 2020, 9:27am

Hi, Wolfram_Haussig,

I have tried the cluster settings API using PUT and now getting error
"Your request: '/_cluster/settings' payload is not allowed."

Also I tried using from -size in the query but still it gives error.
Result window is too large, from + size must be less than or equal to: [10000] but was [10009]
I believe using from-size we can only get the batches which lie under 1k range.

Note that here my intention is to use the bucket results i.e. aggregations and not the hits.

system · April 1, 2020, 9:27am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.