How to use arithmetic operation inside bucket aggregation on a very large data containing millions document in elasticsearch?


(Vaibhav Goel) #1

Using script within bucket aggregation on a very large set of data takes very long time to return result and many a times Elasticsearch health goes down due to this.

ES has around 10 millions documents and it haven't given result while performing arithmetic operation on 2 fields and make ES down, on using 1 field and 1 integer it return result.. My query :

GET urza/user/_search
{
"size": 0,
"aggs" : {
"prices" : {
"histogram" : {
"script" :
{
"inline":"doc['total_revenue'].value * doc['total_event'].value ",
"lang": "painless"
},
"interval" : 100
}
}
}
}


(Colin Goodheart-Smithe) #2

What are the range of values for total_revenue and total_event (i.e. what are the minimum and maximum values of those two fields)?


(Vaibhav Goel) #3

total revenue can be from 0 to 1000000 and total event can be from 0 to 1000


(Colin Goodheart-Smithe) #4

Ok so the range of values which can be output from your script is 1,000,000 * 1,000 = 1,000,000,000. The interval for your histogram is 100 so you are requesting 10,000,000 buckets from the histogram. This will take alot amount of memory so the reason you are not getting a response and the cluster health is affected is likely to be because this request is causing an OutOfMemoryError. You can confirm this by looking at the Elasticsearch server log when you see the happen.

I would suggest you choose a larger interval which will create a smaller number of buckets or, if you can, increase the amount of heap memory you give your Elasticsearch server.


(Vaibhav Goel) #5

is this the best way to do arithmetic operation or their is any other better way to perform arithmetic operation and is this query can perform good for even more documents???


(Colin Goodheart-Smithe) #6

it is not the number of documents that is causing the issue here. The issue is down to you asking for 10,000,000 buckets of information. You would have the same issue if instead of multiplying the two fields in a script you had a single field which already contained the result of total_revenue * total_event for each document.

You need to either give Elasticsearch more heap memory so it can store the state for the number of buckets you are requesting or change the request to ask for less buckets in the histogram (for example by increasing the interval) until the server can cope with the request.

If we take a step back and look at this from a higher level, what are you trying to present to the user by making this request? what question are you trying to answer from the data?


(Vaibhav Goel) #7

Thanks a lot.. This solve my issue...i will either give it more heap space or increase the interval size.


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.