Terms aggregation very slow - but very fast with murmur3


(Thomas Decaux) #1

I do a nested aggregation like this:

"aggregations": {
    "query": {
      "terms": {
        "field": "query",
        "size": 1000
      },
      "aggs": {
        "url": {
          "terms": {
            "field": "url.keyword",
            "size": 10
          }
        }
      }
    }
  }

Both query and url.keyword are stored as keyword. The query takes 6 secondes with ES 5. But 30 ms if I use murmur3 plugin. Hence my question, why elasticsearch terms aggregation dont use a hash for the "group by" then lookup to get the real field value?


(Mark Harwood) #2

We don't want to take responsibility for any hash collision inaccuracies - that can be a decision you take in your app code e.g. you could look at using breadth_first collect mode on your hashed term then embed a full version of the term as a child agg.


(Thomas Decaux) #3

Ho I see,

top_hits is working fine ! I got results in 3 secondes, which is very slow compared to 300ms without it ..... For now, I got exactly the same results, murmur3 is very fast.

So, murmur3 team should implement a kind of "terms_hash" aggregation :wink:

There is already an approximate aggregation : cardinality where we can configure the precision, why you dont do the same with terms?


(Mark Harwood) #4

Where possible, internally we prefer to count using ordinals instead of strings. Ordinals are numbers used as a placeholder for a unique string.
Normally ordinals are used if the terms agg is the root of the aggs tree but we will flip to a different policy for deeper nested sections of the agg tree relying on maps of strings.

[WARNING - EXPERT LEVEL FEATURE THAT MAY WREAK HAVOC......]
In your dev environment - you can look at setting a choice of execution_hint [1] to influence the strategy used for collecting buckets. The global_ordinals will likely be hugely expensive, allocating a large array per parent agg bucket and the health of your nodes may be an issue. However global_ordinals_hash may be an option to look at. Take care with these settings and only try first in a non-production environment.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-execution-hint


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.