I have one query which does aggregation on day interval for specified date range (like last month). Data is sorted based on 'MyUsage' field to return top 5 'FldNameWise'.
Problem is it retuns null aggregations for some of the day interval although data is present. If I select only specific 'FldNameWise' data then that perticular day interval have values.
Index template and other things are correct only. Not sure why null values coming in aggregation.
If you have multiple nodes/shards then some of them may not have contributed results for certain "FldName" terms to the final analysis because they weren't seen as competitive locally (a low usage average).
This wizard walks you through the conditions under which this may happen, how any inaccuracies might be shown and what alternative approaches to try:
In this case it may prove to be more efficient to break your search into 2 requests - one to identify the names with the most usage and then another follow-up query to fill in their low-level details.
Trying to do it all in one request may be inefficient because the shard_size may need increasing to retrieve more terms for fully accurate usage stats and if shard_size is much greater than size a large number of results (and all their detailed date histograms) will be thrown away.
Though I will explore on breaking search requests into 2 request as suggested by you.
We have only single node and single shard.
Moreover, If I select only particular value in 'FldName' or remove "order": { "FldOrder": "desc" }, then I get data for all day intervals.
Out of interest, how many unique "FldName" values do you have? (The cardinality agg can help reveal this).
Also, how many total docs in the index and in the time range you query on?
I have around only 100 unique "FldName" values. I query on max 30 days datetime range.
that comprises around 30 indices as I have daily index. One index contains around 38400 documents, so for all 30 indices it might go upto 1152000 approax.
Ah - that explains my confusion - multiple indices means multiple shards which means the potential for this distributed counting problem (even though your content is not currently physically distributed across multiple nodes).
"When the aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be determined and is given a value of -1 to indicate this."
All of my original advice still stands.
That's not a huge number so you could try set shard_size to 100 to gather all the stats - your final result is only 5 so we'd throw away 95 date histograms which is less than ideal. It would be more efficient to do 2 requests but you might get away with 1 in terms of memory depending on the length of the date range and the size of date interval.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.