How can I get top five results in date_histogram aggregations?

Hi all,

I have a simple aggregation that gives me the number of docs for each day in the last thirty days (using date_histogram aggregation).
I need the top 5 days as amount of documents.
How can I get it?

Thanks in advance!

    {
      "size": 0,
      "query": {
        "bool": {
          "must": [
            {
              "query_string": {
                "query": "*",
                "analyze_wildcard": true
              }
            },
            {
              "range": {
                "unixtsms": {
                  "gte": 1496527200000,
                  "lte": 1497045599999,
                  "format": "epoch_millis"
                }
              }
            }
          ],
          "must_not": []
        }
      },
      "_source": {
        "excludes": []
      },
      "aggs": {
        "2": {
          "date_histogram": {
            "field": "unixtsms",
            "interval": "1d",
            "time_zone": "Europe/Berlin",
            "min_doc_count": 1
          }
        }
      }
    }

You can add a top_hits aggregation as a sub aggregation to your date_histogram, this should give you what you are after

Hi colings86,
thanks for your reply.
Unfortunately I cannot get the result that I'm looking for with "top_hits" aggregation.

This is the response I get with date_histogram:

...
"aggregations": {
    "my_date_histo": {
      "buckets": [
        {
          "key_as_string": "2017-06-04T00:00:00.000Z",
          "key": 1496534400000,
          "doc_count": 611
        },
        {
          "key_as_string": "2017-06-05T00:00:00.000Z",
          "key": 1496620800000,
          "doc_count": 116162
        },
        {
          "key_as_string": "2017-06-06T00:00:00.000Z",
          "key": 1496707200000,
          "doc_count": 123610
        },
        {
          "key_as_string": "2017-06-07T00:00:00.000Z",
          "key": 1496793600000,
          "doc_count": 124738
        },
        {
          "key_as_string": "2017-06-08T00:00:00.000Z",
          "key": 1496880000000,
          "doc_count": 84882
        },
        {
          "key_as_string": "2017-06-09T00:00:00.000Z",
          "key": 1496966400000,
          "doc_count": 151998
        }
      ]
    }
  }

I cannot figure out how can I pass "doc_count" value to "top_hits" aggregation.
Can you help me?
Thank you!

Try this:

{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "*",
            "analyze_wildcard": true
          }
        },
        {
          "range": {
            "unixtsms": {
              "gte": 1496527200000,
              "lte": 1497045599999,
              "format": "epoch_millis"
            }
          }
        }
      ],
      "must_not": []
    }
  },
  "_source": {
    "excludes": []
  },
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "unixtsms",
        "interval": "1d",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      },
      "aggs": {
        "top_five": {
          "top_hits": {
            "size": 5
          }
        }
      }
    }
  }
}

It's exactly what I attempted to do and this is what I got :confused:

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 100,
    "successful": 100,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "2": {
      "buckets": []
    }
  }
}

OK, I've written it again (probably there's something different in the code that you pasted for me but I cannot find the mistake) and the result is:

{
  "took": 54,
  "timed_out": false,
  "_shards": {
    "total": 100,
    "successful": 100,
    "failed": 0
  },
  "hits": {
    "total": 602001,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "2": {
      "buckets": [
        { 
          "key_as_string": "2017-06-04T00:00:00.000+02:00",
          "key": 1496527200000,
          "doc_count": 470,
          "top_five": {
            "hits": {
              "total": 470,
              "max_score": 2,
              "hits": [
                  {FIRST DOCUMENT},
                  {SECOND DOCUMENT}
              ]
            }
          }
        },
        {
          "key_as_string": "2017-06-05T00:00:00.000+02:00",
          "key": 1496613600000,
          "doc_count": 116021,
          "top_five": {
            "hits": {
              "total": 116021,
              "max_score": 2,
              "hits": [
                  {FIRST DOCUMENT},
                  {SECOND DOCUMENT}
              ]
            }
          }
        },
          "key_as_string": "2017-06-06T00:00:00.000+02:00",
          "key": 1496700000000,
          "doc_count": 123892,
          "top_five": {
            "hits": {
              "total": 123892,
              "max_score": 2,
              "hits": [
                  {FIRST DOCUMENT},
                  {SECOND DOCUMENT}
              ]
            }
          }
        },
............
CUT

I made the top 2 for brevity but it gives me the top two documents for every day bucket :confused:

I made the top 2 for brevity but it gives me the top two documents for every day bucket :confused:

I had understood that this is what you are after?

Re-reading your first post I now think maybe what you are after is the buckets of the date histogram sorted by doc_count and limited to the 5 daily buckets that contain the highest doc count. Is that correct?

Using max_bucket aggregation, I got the MAX bucket as number of docs. Unfortunately max_bucket aggregation hasn't "size" parameter so I cannot get the Top N buckets :expressionless:

{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "*",
            "analyze_wildcard": true
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": 1496527200000,
              "lte": 1497045599999,
              "format": "epoch_millis"
            }
          }
        }
      ],
      "must_not": []
    }
  },
  "_source": {
    "excludes": []
  },
  "aggs": {
    "1": {
      "max_bucket": {
        "buckets_path": "1-bucket>_count"
      }
    },
    "1-bucket": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "1d",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      }
    }
  }
}

The result is:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 30,
    "successful": 30,
    "failed": 0
  },
  "hits": {
    "total": 602001,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "1": {
      "value": 152092,
      "keys": [
        "2017-06-09T00:00:00.000+02:00"
      ]
    },
    "1-bucket": {
      "buckets": [
        {
          "key_as_string": "2017-06-04T00:00:00.000+02:00",
          "key": 1496527200000,
          "doc_count": 470
        },
        {
          "key_as_string": "2017-06-05T00:00:00.000+02:00",
          "key": 1496613600000,
          "doc_count": 116021
        },
        {
          "key_as_string": "2017-06-06T00:00:00.000+02:00",
          "key": 1496700000000,
          "doc_count": 123892
        },
        {
          "key_as_string": "2017-06-07T00:00:00.000+02:00",
          "key": 1496786400000,
          "doc_count": 124456
        },
        {
          "key_as_string": "2017-06-08T00:00:00.000+02:00",
          "key": 1496872800000,
          "doc_count": 85070
        },
        {
          "key_as_string": "2017-06-09T00:00:00.000+02:00",
          "key": 1496959200000,
          "doc_count": 152092
        }
      ]
    }
  },
  "status": 200
}

How can I get the Top 5 instead of just the MAX? Any Idea?

I had understood that this is what you are after?

Re-reading your first post I now think maybe what you are after is the buckets of the date histogram sorted by doc_count and limited to the 5 daily buckets that contain the highest doc count. Is that correct?

Correct! :slight_smile:
Sorry for my poor english :blush:

No need to apologise, it was my fault as I misread what your intentions were. Unfortunately you can't currently do this with the histogram aggregation. We have the following issue open to create a pipeline aggregation to allow you to sort and truncate the output of other aggregations though: https://github.com/elastic/elasticsearch/issues/14928

For now you would have to do this client side but in your case it looks like you are using Kibana for the client side? One other thing you could do is to use a terms aggregation instead and use the script parameter to in the terms aggregation to get the date from the unixtsms field, and output a string in the format yyyy-MM-dd which the terms aggregation can then use for the key of the buckets.

I'm almost happy that there is an open issue: I was starting to think that I was stupid :slight_smile:
Thank you very much for the patience and the suggestions!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.