Top N documents with multiple buckets

Pieter_Agenbag · August 31, 2015, 1:20pm

Hi.

I'm trying to get the top N documents for a specific aggregation , but with multiple term buckets - or any other way to get the results.

Basically , I have a documents with from.ip,from.hostname , to.ip,to.hostname, from.bytes,to.bytes, total.bytes (amongst other) fields.

I would like to just get the top 10 documents in terms of sum total.bytes.

In sql terms...

select from.ip,from.hostname,to.ip,to.hostname,sum(total.bytes) from myTable group by from.ip,from.hostname,to.ip,to.hostname

I have the following query request, but it returns the top 10 for Each bucket , which obviously results in waaay too many results.
P.s. I know my query is wrong in that it sorts on _count and not sum(total.bytes) ...

{
    "query": {
        "match_all": {}
    },
    "size": 10,
    "aggs": {
        "group": {
            "terms": {
                "field": "from.ip.raw",
                "size": 10,
                "order": {
                    "_count": "desc"
                }
            },
            "aggs": {
                "group": {
                    "terms": {
                        "field": "from.hostname.raw",
                        "size": 10,
                        "order": {
                            "_count": "desc"
                        }
                    },
                    "aggs": {
                        "group": {
                            "terms": {
                                "field": "to.ip.raw",
                                "size": 10,
                                "order": {
                                    "_count": "desc"
                                }
                            },
                            "aggs": {
                                "group": {
                                    "terms": {
                                        "field": "to.hostname.raw",
                                        "size": 10,
                                        "order": {
                                            "aggval": "desc"
                                        }
                                    },
                                    "aggs": {
                                        "aggval": {
                                            "sum": {
                                                "field": "total.bytes"
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

Pieter_Agenbag · September 1, 2015, 6:45am

Did some more searching and seems a lot of people (probably people coming from relational databases) , are asking this or similar questions.

I seem to have solved my problem using a single (script) terms bucket , combining the four "group by" fields into one... I'll then just have to split the field values out again when processing the result.

I hope this is the right way to do it...or at least an acceptable way

{
    "query": {
        "match_all": {}
    },
    "size": 10,
    "aggs": {
        "flow": {
            "terms": {
                "script": "doc['from.ip'].value + ',' + doc['from.hostname'].value + ',' + doc['to.ip'].value + ',' + doc['to.hostname'].value",
                "size": 10,
                "order": {
                    "total": "desc"
                }
            },
			"aggs": {
				"total": {
					"sum": {
						"field": "total.bytes"
					}
				}
				,"from": {
					"sum": {
						"field": "from.bytes"
					}
				}
				,"to": {
					"sum": {
						"field": "to.bytes"
					}
				}
			}
		}
	}
}

mvg · September 1, 2015, 11:47am

I think in this case the top_hits aggregation is something you would use in this case. It allows you to return the top matching documents per bucket (depending on the level you define it).

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html?q=top_hits#search-aggregations-metrics-top-hits-aggregation

samthurston · September 16, 2015, 3:35pm

top_hits doesn't allow for sub-aggregation so getting stats on the results seems problematic

mvg · September 17, 2015, 1:14pm

On what kind of property do you like to aggregate under the top_hits agg?

top_hits is a metric agg and the idea is that it should be a leaf agg (just
like other metric aggs), If you like to aggregate deeper then you'll need
to add an additional bucket agg on the same level that the top_hits agg is
and then place another top_hits agg under that new bucket agg.

Topic		Replies	Views
Getting top_buckets(size=5) by using max(bucket.doc._score) + also getting the top_document() on each bucket Elasticsearch	1	366	July 5, 2017
How can I limit bucket to n results for averaging? Elasticsearch	6	1755	July 5, 2017
Top N documents from top_hits, rather than top N per bucket Elasticsearch	1	885	July 5, 2017
Terms aggs and then sort all data Elasticsearch	4	606	January 31, 2019
Terms agregation Elasticsearch	1	316	May 15, 2020

Top N documents with multiple buckets

Related topics