Terms aggregation very slow - but very fast with murmur3

I do a nested aggregation like this:

"aggregations": {
    "query": {
      "terms": {
        "field": "query",
        "size": 1000
      },
      "aggs": {
        "url": {
          "terms": {
            "field": "url.keyword",
            "size": 10
          }
        }
      }
    }
  }

Both query and url.keyword are stored as keyword. The query takes 6 secondes with ES 5. But 30 ms if I use murmur3 plugin. Hence my question, why elasticsearch terms aggregation dont use a hash for the "group by" then lookup to get the real field value?

We don't want to take responsibility for any hash collision inaccuracies - that can be a decision you take in your app code e.g. you could look at using breadth_first collect mode on your hashed term then embed a full version of the term as a child agg.

Ho I see,

top_hits is working fine ! I got results in 3 secondes, which is very slow compared to 300ms without it ..... For now, I got exactly the same results, murmur3 is very fast.

So, murmur3 team should implement a kind of "terms_hash" aggregation :wink:

There is already an approximate aggregation : cardinality where we can configure the precision, why you dont do the same with terms?

Where possible, internally we prefer to count using ordinals instead of strings. Ordinals are numbers used as a placeholder for a unique string.
Normally ordinals are used if the terms agg is the root of the aggs tree but we will flip to a different policy for deeper nested sections of the agg tree relying on maps of strings.

[WARNING - EXPERT LEVEL FEATURE THAT MAY WREAK HAVOC......]
In your dev environment - you can look at setting a choice of execution_hint [1] to influence the strategy used for collecting buckets. The global_ordinals will likely be hugely expensive, allocating a large array per parent agg bucket and the health of your nodes may be an issue. However global_ordinals_hash may be an option to look at. Take care with these settings and only try first in a non-production environment.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-execution-hint

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.