I have a large daily index (150-200GB) which has several shards on different nodes. I want to check which terms have on average the largest values. However, when I run a terms aggregation ordered by the average aggregation I need to change the "size" of the terms aggregation to get to the "real" average of the term in question.
I worked out that the doc_count changes because there are documents without the "my_number_field", but that doesn't explain why I don't get the average value from all the documents where "my_term_field" is the same? Why do I need to change the size of the documents returned? I only need the top 100 but in the background, ES should check all unique term, calculate the average and return the TOP 100 after those were ordered.
What am I missing?
Thanks! It's a bit disappointing since there aren't too many docs but whatev'.
So this is what I missed, but I'm not sure what options do I have to get the TOP 100 without requesting the TOP 10,000?
As far as I can tell, one of the options is to have the whole index on one shard if possible, right?
Is there another way to have ES working through all documents in the index in question but only send the TOP 100 in the response?
EDIT: I did check composite query but you can't order by the sub aggregation, only by the "key" (term) which means I'd need to scroll through the results and order them programmatically which I want to avoid at all cost. My cluster is beefy enough to do the heavy lifting so I'd need a solution in ES even if it's a memory or CPU heavy task.
EDIT2:
I think I may have found the solution but not sure if this is the best way
Also, as in a lot of situation, my issues/solution may not help everyone.
Filter the results set in the query body as best as possible (E.g., using _ exists _). So the returned doc count will be as low as possible.
Write the parent term aggregation and set the size which you as acceptable and sort the documents. This will be the basis for all shards for their TOP list.
This area is complex because distributed systems make things complex.
If you can work with a single shard/index that makes analytics more accurate but brings its own problems in terms of scale.
Assuming you stick with multiple shards/indices you'll likely benefit from the shard_size setting covered in the docs. That can increase accuracy which you can check by looking at the doc_count_error_upper_bound value in the results. When it's zero your results should be accurate. The size parameter has a similar effect but with the disadvantage that it increases the number of final results returned - many of which you may be uninterested in.
If you hit the "too many buckets" issue you've pushed size/shard-size too hard and are trying to squeeze too much into your responses. You then need to back up and try break your analysis into multiple requests. That could be done using queries as you suggest or using partitioning in the terms aggregation.
Here's a sketch of a decision process that might help.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.