How can I get top five results in date_histogram aggregations?

Fleaa · July 3, 2017, 3:06pm

Hi all,

I have a simple aggregation that gives me the number of docs for each day in the last thirty days (using date_histogram aggregation).
I need the top 5 days as amount of documents.
How can I get it?

Thanks in advance!

    {
      "size": 0,
      "query": {
        "bool": {
          "must": [
            {
              "query_string": {
                "query": "*",
                "analyze_wildcard": true
              }
            },
            {
              "range": {
                "unixtsms": {
                  "gte": 1496527200000,
                  "lte": 1497045599999,
                  "format": "epoch_millis"
                }
              }
            }
          ],
          "must_not": []
        }
      },
      "_source": {
        "excludes": []
      },
      "aggs": {
        "2": {
          "date_histogram": {
            "field": "unixtsms",
            "interval": "1d",
            "time_zone": "Europe/Berlin",
            "min_doc_count": 1
          }
        }
      }
    }

colings86 · July 3, 2017, 5:04pm

You can add a top_hits aggregation as a sub aggregation to your date_histogram, this should give you what you are after

Fleaa · July 4, 2017, 8:22am

Hi colings86,
thanks for your reply.
Unfortunately I cannot get the result that I'm looking for with "top_hits" aggregation.

This is the response I get with date_histogram:

...
"aggregations": {
    "my_date_histo": {
      "buckets": [
        {
          "key_as_string": "2017-06-04T00:00:00.000Z",
          "key": 1496534400000,
          "doc_count": 611
        },
        {
          "key_as_string": "2017-06-05T00:00:00.000Z",
          "key": 1496620800000,
          "doc_count": 116162
        },
        {
          "key_as_string": "2017-06-06T00:00:00.000Z",
          "key": 1496707200000,
          "doc_count": 123610
        },
        {
          "key_as_string": "2017-06-07T00:00:00.000Z",
          "key": 1496793600000,
          "doc_count": 124738
        },
        {
          "key_as_string": "2017-06-08T00:00:00.000Z",
          "key": 1496880000000,
          "doc_count": 84882
        },
        {
          "key_as_string": "2017-06-09T00:00:00.000Z",
          "key": 1496966400000,
          "doc_count": 151998
        }
      ]
    }
  }

I cannot figure out how can I pass "doc_count" value to "top_hits" aggregation.
Can you help me?
Thank you!

colings86 · July 4, 2017, 8:27am

Try this:

{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "*",
            "analyze_wildcard": true
          }
        },
        {
          "range": {
            "unixtsms": {
              "gte": 1496527200000,
              "lte": 1497045599999,
              "format": "epoch_millis"
            }
          }
        }
      ],
      "must_not": []
    }
  },
  "_source": {
    "excludes": []
  },
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "unixtsms",
        "interval": "1d",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      },
      "aggs": {
        "top_five": {
          "top_hits": {
            "size": 5
          }
        }
      }
    }
  }
}

Fleaa · July 4, 2017, 9:10am

It's exactly what I attempted to do and this is what I got

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 100,
    "successful": 100,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "2": {
      "buckets": []
    }
  }
}

Fleaa · July 4, 2017, 9:34am

OK, I've written it again (probably there's something different in the code that you pasted for me but I cannot find the mistake) and the result is:

{
  "took": 54,
  "timed_out": false,
  "_shards": {
    "total": 100,
    "successful": 100,
    "failed": 0
  },
  "hits": {
    "total": 602001,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "2": {
      "buckets": [
        { 
          "key_as_string": "2017-06-04T00:00:00.000+02:00",
          "key": 1496527200000,
          "doc_count": 470,
          "top_five": {
            "hits": {
              "total": 470,
              "max_score": 2,
              "hits": [
                  {FIRST DOCUMENT},
                  {SECOND DOCUMENT}
              ]
            }
          }
        },
        {
          "key_as_string": "2017-06-05T00:00:00.000+02:00",
          "key": 1496613600000,
          "doc_count": 116021,
          "top_five": {
            "hits": {
              "total": 116021,
              "max_score": 2,
              "hits": [
                  {FIRST DOCUMENT},
                  {SECOND DOCUMENT}
              ]
            }
          }
        },
          "key_as_string": "2017-06-06T00:00:00.000+02:00",
          "key": 1496700000000,
          "doc_count": 123892,
          "top_five": {
            "hits": {
              "total": 123892,
              "max_score": 2,
              "hits": [
                  {FIRST DOCUMENT},
                  {SECOND DOCUMENT}
              ]
            }
          }
        },
............
CUT

I made the top 2 for brevity but it gives me the top two documents for every day bucket

colings86 · July 4, 2017, 9:50am

I made the top 2 for brevity but it gives me the top two documents for every day bucket

I had understood that this is what you are after?

Re-reading your first post I now think maybe what you are after is the buckets of the date histogram sorted by doc_count and limited to the 5 daily buckets that contain the highest doc count. Is that correct?

Fleaa · July 4, 2017, 9:52am

Using max_bucket aggregation, I got the MAX bucket as number of docs. Unfortunately max_bucket aggregation hasn't "size" parameter so I cannot get the Top N buckets

{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "*",
            "analyze_wildcard": true
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": 1496527200000,
              "lte": 1497045599999,
              "format": "epoch_millis"
            }
          }
        }
      ],
      "must_not": []
    }
  },
  "_source": {
    "excludes": []
  },
  "aggs": {
    "1": {
      "max_bucket": {
        "buckets_path": "1-bucket>_count"
      }
    },
    "1-bucket": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "1d",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      }
    }
  }
}

The result is:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 30,
    "successful": 30,
    "failed": 0
  },
  "hits": {
    "total": 602001,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "1": {
      "value": 152092,
      "keys": [
        "2017-06-09T00:00:00.000+02:00"
      ]
    },
    "1-bucket": {
      "buckets": [
        {
          "key_as_string": "2017-06-04T00:00:00.000+02:00",
          "key": 1496527200000,
          "doc_count": 470
        },
        {
          "key_as_string": "2017-06-05T00:00:00.000+02:00",
          "key": 1496613600000,
          "doc_count": 116021
        },
        {
          "key_as_string": "2017-06-06T00:00:00.000+02:00",
          "key": 1496700000000,
          "doc_count": 123892
        },
        {
          "key_as_string": "2017-06-07T00:00:00.000+02:00",
          "key": 1496786400000,
          "doc_count": 124456
        },
        {
          "key_as_string": "2017-06-08T00:00:00.000+02:00",
          "key": 1496872800000,
          "doc_count": 85070
        },
        {
          "key_as_string": "2017-06-09T00:00:00.000+02:00",
          "key": 1496959200000,
          "doc_count": 152092
        }
      ]
    }
  },
  "status": 200
}

How can I get the Top 5 instead of just the MAX? Any Idea?

Fleaa · July 4, 2017, 9:54am

I had understood that this is what you are after?

Re-reading your first post I now think maybe what you are after is the buckets of the date histogram sorted by doc_count and limited to the 5 daily buckets that contain the highest doc count. Is that correct?

Correct!
Sorry for my poor english

colings86 · July 4, 2017, 10:00am

No need to apologise, it was my fault as I misread what your intentions were. Unfortunately you can't currently do this with the histogram aggregation. We have the following issue open to create a pipeline aggregation to allow you to sort and truncate the output of other aggregations though: https://github.com/elastic/elasticsearch/issues/14928

For now you would have to do this client side but in your case it looks like you are using Kibana for the client side? One other thing you could do is to use a terms aggregation instead and use the script parameter to in the terms aggregation to get the date from the unixtsms field, and output a string in the format yyyy-MM-dd which the terms aggregation can then use for the key of the buckets.

Fleaa · July 4, 2017, 10:24am

I'm almost happy that there is an open issue: I was starting to think that I was stupid
Thank you very much for the patience and the suggestions!

system · August 1, 2017, 10:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aggregation size 0 for top results Elasticsearch	14	12677	July 12, 2018
Check the top_hits aggregation results to see if every top hit has a field with a specific value Elasticsearch	6	1267	October 8, 2019
Get top n with date histogram Elasticsearch	4	685	April 10, 2020
Get Current Top 5 Over Time with Date Histogram Elasticsearch	1	317	November 19, 2021
Elasticsearch Aggregation - Return a list of array from documents found Elasticsearch	4	1610	January 6, 2021

How can I get top five results in date_histogram aggregations?

Related topics