Elasticsearch terms aggregations performance of many unique values

I use terms aggregations to aggregate my data . But there is a problem.
For example , If i have 300 million data in my ES cluster. I use terms aggregation to aggregate the bytes of ip. If the ip field has 1000 unique ip, its perfomance is very good , but if I have many unique values in one filed, such as 1 million unique ip, the aggregations is very slow.
So, is there any solutions of it to optimize the performance

If your query is very selective (matches few IPs) then it's worth looking at setting the execution_hint to "map".

I need to aggregate top 10 srcip_port_dstip_port by flow.bytes in 300 million data.
Here is my term aggregations.

"aggs": {
    "data": {
      "terms": {
        "field": "src_dst_ip_port.keyword",
        "size": 10,
        "collect_mode": "breadth_first",
        "execution_hint": "map"
      },
      "aggs": {
        "sum_bits": {
          "sum": {
            "field": "flow.bytes"
          }
        }
      }
    }
  }

I think "map" is just suitable for small amount of data which is not work well in my search.

The query phrase may only to select time range, which may return 330 million data(1.1 T).
And I write data to my index constantly. I have used time-based indices. such as xxxx-2020.06.12
I have read a lot of articles about tuning the performance about high-cardinality terms aggregations. Someone recommanded to improve the refresh_interval and set eager-global-ordinals. I wonder if it work. Or some othrt solutions?

I expect the breadth-first mode is not helping here. To explain why, it's useful to understand why it was added as an alternative to the depth-first default.
Normally each document travels depth first, top to bottom, through the agg tree (like a ball dropping into a bean machine). Unlike a bean machine, new pins are added in the form of buckets as the terms are unpacked from the document. This mode works well for most cases (likely yours too). Where it broke down was when we looked at "movie" documents from IMDB and wanted to get a list of the top ten actors and who they acted with. In this case each document has an array of N actors, producing N buckets, and the next stage in the tree which finds the co-stars adds another N child buckets to each parent. So each document produces N squared buckets which is a massive explosion even from small datasets.

The breadth-first solution overcomes these massive combinatorial explosions by finding the top ten actors first (Tom Hanks etc) and trims away all the less-prolific actors. The next stage for only these ten surviving actors is to play back the set of documents associated with them to all the child aggregations to expand their list of co-stars. This is more efficient but can still be costly - the set of Lucene document ids for every actor is buffered in memory in case they survive the initial breadth-first round of pruning and need to be flushed into the tree. This buffering is probably not too bad for IMDB data (just how many movies can an actor star in?) but in the case of IP addresses they could generate many millions of documents per term. So in your case it's probably better to use depth-first and let all of those balls trickle down the full extent of the agg tree rather than accumulate in RAM at the top levels of the tree. You only have a single average agg associated with each term which is probably cheaper than keeping a stack of all document IDs per term.

Cause I have several layers terms aggregations. It is not suitable for me. I have test these two collect mode and breadth-first mode worked better.

I've seen this example before in the elasticsearch's official documentation. The data source is sflow data. It's true that each document's field will contain one value instead of a list of values. So I want to know that how "terms with order by flow.bytes(not count)" works in the two collect mode?

I have set "eager-global-ordinals" true in my template which means that the global ordinals will calculate after the elasticsearch refresh instad of the first search. It works. Next step I will try to increase the refresh interval. Thanks a lot.

If you order by a child aggregation then the breadth-first parameter is ignored for that child aggregation.
Glad to hear you got things working.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.