Agrregation queries taking long


(Hitesh Chavhan) #1

Hi
I have the below aggregation queries which aggregates all the employees based on their score.
This is a weekly score given to each employee. Means there are such 3032561 records in ES each having a list of dict for each employee.

I am querying ES using node.js and the query is not returning data as its getting time out.
Someone told me that ES not able to perform agg on this amount of data, I don't think that's the case. Please help me out.

below is the query.
{
"query": {
"bool": {
"must": []
}
},
"aggs": {
"emp": {
"nested": {
"path": "emp"
},
"aggs": {
"scores": {
"terms": {
"field": "emp.emp_name.case_sensitive",
"size": 0,
"order": {
"total_score": "desc"
}
},
"aggs": {
"total_score": {
"sum": {
"field": "emp.score"
}
}
}
}
}
}
}
}


(Mark Harwood) #2

What are the root level docs? I can see you are using nested in the agg so presumably each root doc can have more than one employee. How many employees are there per root doc?

Have you done an estimate of how big the JSON response would be ?


(Hitesh Chavhan) #3

The root level is performance doc, which indeed contains emp as list of dictionary having more than 1 employee.
I guess there would be more than 8 employees per root doc.
Below is the format of same.
{
"date":"12/12/2009",
"record_id":1003
"emp":[{ "emp_name":"Robert Madis","score":10},
{"emp_name":''Piras jicking","score":12}
]
}

I had not done an estimation of JSON response. But considering the data It should not be that big.


(Mark Harwood) #4

Let's work it out:

Theoretically it's possible the 3m+ performance docs could all refer to the same 8 employees so the final result could be 8 only employees but somehow I guess that's highly unlikely otherwise no one would get any work done due to constant performance reviews.
Let's assume all employees are unique and require 50 bytes to return names and scores.

This gives us 3032561 x 8 x 50 = 1.1GB of JSON.

That's a lot of JSON data to create and serialize so I I'm not surprised it takes a while.


(Hitesh Chavhan) #5

So As I understand , with larger data sizes and aggregations taking place serialization and creating would take time.

The overall aggregation is working faster but creation of JSON and Serialization process consumes much of the resultant time.


(system) #6