Loading of global ordinals seem to sometimes run forever, with no circuit breakers triggered (see numbers below). Is that intentional?
The circuit breaker doesn't check the time that a query takes, only the amount of memory that is created.
Loading global ordinals can be expensive on large shards/indices with high cardinality field. The doc is maybe a bit optimistic and could mention workarounds if loading is slow or if the searcher is refreshed often (each refresh of the main searcher invalidates the loaded global ordinals).
- Seems like fields defined with IP datatype in the mapping take even longer to load (10-20% slower) - is there a reason for that? should we use multi-field (IP datatype for search, keyword for aggs)? Can we avoid multi-field?
The IP datatype encodes any ip in a keyword
field so I guess that the decoding makes the loading of global ordinals slower. The ip
field is represented as a number internally in 2.x, in 5.x we added the support for ipv6 and switched to a keyword
representation. This means that the terms
aggregation will use the global ordinals execution hint by default for this field. If the loading of global ordinals is too slow in your use case you can switch to the map
execution which will be slower and requires more memory per query but doesn't require to load a shared resource on heap.
- Other aggregations, most notably the date histogram aggregation, run perfectly well and very fast also on fields with very high cardinality. Is there a way to use the same technique for terms aggregation as well? in our case approximate counts will be ok too.
Date histograms work with numbers directly so they don't build global ordinals at all. The terms
aggregation should be fast when the global ordinals are built so there's nothing that we do better in date histograms, it's just that they work on different type of data. For keyword
field we try to avoid loading a map that contains all unique terms for each query which is why the default mode is to build the global ordinals. You can opt out from this default by setting the execution_hint
to map
.
- The docs for the
map
execution hint state (Terms aggregation | Elasticsearch Guide [8.11] | Elastic): "Please note that Elasticsearch will ignore this execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints." . Since the map execution hint is (currently) required for such queries to run, can the docs state those "not applicable" scenarios explicitly?
Currently it is ignored if you set the execution_hint
to global_ordinals
and global ordinals are not available for your source (script, field mapped as numbers, ...). map
should always be honored.
- Can global ordinals be monitored somehow? I couldn't find any obvious metric that would return the amount of memory it consumes, and in our case where significant amount of heap space is consumed, we just attribute it to global ordinals, true or not.
They are monitored in the fielddata
stats, we could have a dedicated section but if you don't have any text
field that loads fielddata
then the memory reported can be attributed to global ordinals entirely.