Global ordinals performance and size on-heap

jimczi · November 27, 2018, 11:34am

Loading of global ordinals seem to sometimes run forever, with no circuit breakers triggered (see numbers below). Is that intentional?

The circuit breaker doesn't check the time that a query takes, only the amount of memory that is created.
Loading global ordinals can be expensive on large shards/indices with high cardinality field. The doc is maybe a bit optimistic and could mention workarounds if loading is slow or if the searcher is refreshed often (each refresh of the main searcher invalidates the loaded global ordinals).

Seems like fields defined with IP datatype in the mapping take even longer to load (10-20% slower) - is there a reason for that? should we use multi-field (IP datatype for search, keyword for aggs)? Can we avoid multi-field?

The IP datatype encodes any ip in a keyword field so I guess that the decoding makes the loading of global ordinals slower. The ip field is represented as a number internally in 2.x, in 5.x we added the support for ipv6 and switched to a keyword representation. This means that the terms aggregation will use the global ordinals execution hint by default for this field. If the loading of global ordinals is too slow in your use case you can switch to the map execution which will be slower and requires more memory per query but doesn't require to load a shared resource on heap.

Other aggregations, most notably the date histogram aggregation, run perfectly well and very fast also on fields with very high cardinality. Is there a way to use the same technique for terms aggregation as well? in our case approximate counts will be ok too.

Date histograms work with numbers directly so they don't build global ordinals at all. The terms aggregation should be fast when the global ordinals are built so there's nothing that we do better in date histograms, it's just that they work on different type of data. For keyword field we try to avoid loading a map that contains all unique terms for each query which is why the default mode is to build the global ordinals. You can opt out from this default by setting the execution_hint to map.

The docs for the map execution hint state (Terms aggregation | Elasticsearch Guide [8.11] | Elastic): "Please note that Elasticsearch will ignore this execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints." . Since the map execution hint is (currently) required for such queries to run, can the docs state those "not applicable" scenarios explicitly?

Currently it is ignored if you set the execution_hint to global_ordinals and global ordinals are not available for your source (script, field mapped as numbers, ...). map should always be honored.

Can global ordinals be monitored somehow? I couldn't find any obvious metric that would return the amount of memory it consumes, and in our case where significant amount of heap space is consumed, we just attribute it to global ordinals, true or not.

They are monitored in the fielddata stats, we could have a dedicated section but if you don't have any text field that loads fielddata then the memory reported can be attributed to global ordinals entirely.

Topic		Replies	Views
Impact of enabling `eager_global_ordinals` on production traffic Elasticsearch	4	420	June 5, 2021
Global ordinals on high cardinality fields with large indices Elasticsearch	8	466	May 25, 2020
Fielddata memory usage Elasticsearch	2	1224	May 22, 2020
Terms Aggregation performance high cardinality Elasticsearch	8	5205	July 5, 2017
Does eager_global_ordinals speed up metrics calcaulations on high cardinality keyword fields? Elasticsearch	1	296	October 9, 2020

Global ordinals performance and size on-heap

Related topics