Terms Aggregation performance high cardinality

I'm hoping for some clarifications for query performance in elastic 1.5.2 I observed recently.

I have a string field with high cardinality (approx 200,000,000).
I observed that if I use a simple terms aggregation with execution hint global_ordinals_low_cardinality, two things happen:

  1. The query returns same results as with global_ordinals, or global_ordinals_hash.
  2. The query performs significantly faster. (about twice as fast as global_ordinals, and 4 times as fast as global_ordinals_hash.

Here's the query:

{
   "aggs": {
      "myTerms": {
         "terms": {
            "field": "myField",
            "size": 1000,
            "shard_size": 1000,
            "execution_hint": "global_ordinals_low_cardinality"
         }
      }
   }
}

I don't understand why its even legitimate to use global_ordinals_low_cardinality in this instance, because my field has a high cardinality. So perhaps I don't understand what exactly global_ordinals_low_cardinality means?

Secondly, I have another numerical field (long), with roughly same cardinality value.
The values of the long field are actually precomputed hash values (murmur3) for the same string field from above, which i use to greatly speedup cardinality aggregation.
Running the same terms aggregation on the numerical field performs as bad as global_ordinals_hash.
In fact, it doesn't matter what execution hint i use, execution time remains the same.

So why is global_ordinals_low_cardinality applicable for string types, but not for long types? Is it because numerical fields do not require global ordinals at all?

Thanks

1 Like

Hi Roman,

This option is just a hint given to elasticsearch, and in practice they all apply to string fields. That's why it doesn't make any difference on your murmur3 field which is a long. As you guessed, this is because numeric fields don't support ordinals like string fields do.

global_ordinals_low_cardinality is probably the fastest execution mode, but it also comes with non negligible memory usage, which is linear in the number of unique values that you have. This is why we only use it on low-cardinality fields by default.

Thanks for your reply.

I didn't notice abnormal heap usage during the execution with global_ordinals_low_cardinality, at least no more than global_ordinals, but i'll recheck.

Could you perhaps elaborate then how is it possible for a string field terms aggregation to perform significantly better than its long hash counterpart, considering that long values don't require global ordinals at all?

Thanks in advance

The way elasticsearch indexes string fields is more efficient for terms aggregations: every term is assigned a unique ordinal, which is a dense integer between 0 and the total number of values that you have. So in order to count occurrences of individual terms, you just need to create an array whose length is the total number of terms you have and every time a document is collected, you look up the ordinal of the value and increment the counter at this index.

For long values, we can't make assumptions about the range of values, so we have to resort to a hash table, which tends to be a bit slower than a plain array because of collision resolution.

Thanks Adrian, that makes perfect sense.

Is there perhaps a way in elastic 1.x or 2.0 to use bounded numerical values, so that elastic can utilize the more efficient array methods for aggregations? That way, in appropriate scenarios there will be no need for global ordinals (since the values are numeric) and the aggregations could be expected to perform as fast as string terms?

It could work, but would require our APIs to give information about the range of the values that they store, which is not the case today. In general we try to keep our APIs minimal, and given that computing stats or percentiles is a more common use-case than running a terms aggregation on a numeric field, I don't think we would be keen to add such an API.

I understand.

Even though it is stated that global ordinals memory consumption is relatively low, it can amount to a considerable sum, especially when dealing with very high cardinality fields, multiple indices and a limited number of nodes.

I think there are some scenarios where this memory consumption could be reduced. For example, when two or more fields in a document across numerous indices share the same values - not necessarily equal in the same document, for example:

{
    "primeNum1": "7",
    "primeNum2": "13"
}

Today, if I'm not mistaken, every shard of every index will have two global ordinal mappings which are roughly similar.

Is this something that could be addressed in future versions?