Aggregation retuns null buckets

Harikrishna_Sarvaiya · June 10, 2020, 1:45pm

I have one query which does aggregation on day interval for specified date range (like last month). Data is sorted based on 'MyUsage' field to return top 5 'FldNameWise'.

Problem is it retuns null aggregations for some of the day interval although data is present. If I select only specific 'FldNameWise' data then that perticular day interval have values.

Index template and other things are correct only. Not sure why null values coming in aggregation.

Kindly
Sample query:

{
  "aggs": {
    "FldNameWise": {
      "terms": {
        "field": "FldName",
        "order": {
          "FldOrder": "desc"
        },
        "size": 5
      },
      "aggs": {
        "FldOrder": {
          "avg": {
            "field": "MyUsage"
          }
        },
        "DateTimeWise": {
          "date_histogram": {
            "field": "MyDateTime",
            "interval": "1d",
            "time_zone": "abc"
          },
          "aggs": {
            "MyUitl": {
              "avg": {
                "field": "MyUsage"
              }
            },
            "MyUitl2": {
              "avg": {
                "field": "MyUsage2"
              }
            }
          }
        }
      }
    }
  },
  "size": 0,
  "docvalue_fields": [
    {
      "field": "MyDateTime",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "MyDateTime": {
              "gte": 123,
              "lte": 456,
              "format": "epoch_millis"
            }
          }
        }
      ]
    }
  }
}

I am getting null values for 'MyUitl' aggregation in some day bucket even though data is present.

Harikrishna_Sarvaiya · June 11, 2020, 8:43am

Hi, there anyone, plz help

Mark_Harwood · June 11, 2020, 10:22am

If you have multiple nodes/shards then some of them may not have contributed results for certain "FldName" terms to the final analysis because they weren't seen as competitive locally (a low usage average).
This wizard walks you through the conditions under which this may happen, how any inaccuracies might be shown and what alternative approaches to try:

In this case it may prove to be more efficient to break your search into 2 requests - one to identify the names with the most usage and then another follow-up query to fill in their low-level details.

Trying to do it all in one request may be inefficient because the shard_size may need increasing to retrieve more terms for fully accurate usage stats and if shard_size is much greater than size a large number of results (and all their detailed date histograms) will be thrown away.

Harikrishna_Sarvaiya · June 11, 2020, 10:40am

Though I will explore on breaking search requests into 2 request as suggested by you.

We have only single node and single shard.
Moreover, If I select only particular value in 'FldName' or remove "order": { "FldOrder": "desc" }, then I get data for all day intervals.

Mark_Harwood · June 11, 2020, 10:45am

That should be OK then. Do you get a non-zero number for the doc_count_error_upper_bound?
What version of elasticsearch are you on?

Harikrishna_Sarvaiya · June 11, 2020, 10:50am

Yes, I get "doc_count_error_upper_bound": -1 , Using elasticsearch 6.7.2 version

Mark_Harwood · June 11, 2020, 11:03am

Out of interest, how many unique "FldName" values do you have? (The cardinality agg can help reveal this).
Also, how many total docs in the index and in the time range you query on?

Harikrishna_Sarvaiya · June 11, 2020, 11:10am

I have around only 100 unique "FldName" values. I query on max 30 days datetime range.
that comprises around 30 indices as I have daily index. One index contains around 38400 documents, so for all 30 indices it might go upto 1152000 approax.

Mark_Harwood · June 11, 2020, 12:09pm

Ah - that explains my confusion - multiple indices means multiple shards which means the potential for this distributed counting problem (even though your content is not currently physically distributed across multiple nodes).

The reported error bound of -1 means the margin for error cannot be determined

"When the aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be determined and is given a value of -1 to indicate this."

All of my original advice still stands.

That's not a huge number so you could try set shard_size to 100 to gather all the stats - your final result is only 5 so we'd throw away 95 date histograms which is less than ideal. It would be more efficient to do 2 requests but you might get away with 1 in terms of memory depending on the length of the date range and the size of date interval.

Harikrishna_Sarvaiya · June 11, 2020, 1:32pm

Thank you so much. setting shard_size to 100 works. But I will explore other approach which is more efficient as you said.

system · July 9, 2020, 1:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aggregation with null values Elasticsearch	3	17483	December 12, 2016
Need to do sum aggregation with probable null values of field Elasticsearch	1	522	September 15, 2020
Need to do sum aggregation with probable null values of field Elasticsearch	1	631	April 16, 2019
Is the missing value option officially supported in Date Range aggregations? Elasticsearch	8	2948	July 5, 2017
Suitability of Aggregation framework for traditional OLAP style analytics Elasticsearch	1	360	July 6, 2017

Aggregation retuns null buckets

Related topics