Global ordinals performance and size on-heap


(Itamar Syn-Hershko) #1

Hi Elastic folks,

We are seeing significant performance hit stemming from the usage of global ordinals (CPU, heap, GC). All background on our specific situation is below, points in question are:

  1. Loading of global ordinals seem to sometimes run forever, with no circuit breakers triggered (see numbers below). Is that intentional?
  2. The docs state "The loading time of global ordinals depends on the number of terms in a field, but in general it is low, since it source field data has already been loaded. The memory overhead of global ordinals is a small because it is very efficiently compressed." (see https://www.elastic.co/guide/en/elasticsearch/reference/current/eager-global-ordinals.html). This doesn't seem to be the case in our use-case - loading times are high and significant memory is consumed. Are the docs wrong, or is this a corner case worth checking in the ES codebase?
  3. Seems like fields defined with IP datatype in the mapping take even longer to load (10-20% slower) - is there a reason for that? should we use multi-field (IP datatype for search, keyword for aggs)? Can we avoid multi-field?
  4. Other aggregations, most notably the date histogram aggregation, run perfectly well and very fast also on fields with very high cardinality. Is there a way to use the same technique for terms aggregation as well? in our case approximate counts will be ok too.
  5. The docs for the map execution hint state (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html): "Please note that Elasticsearch will ignore this execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints." . Since the map execution hint is (currently) required for such queries to run, can the docs state those "not applicable" scenarios explicitly?
  6. Can global ordinals be monitored somehow? I couldn't find any obvious metric that would return the amount of memory it consumes, and in our case where significant amount of heap space is consumed, we just attribute it to global ordinals, true or not.

Background:

We have an index with a source_ip field. This field has a very high cardinality (about 90% of documents count). Tested with several versions of 6.x, latest 6.4.3.

When executing a terms aggregation on this field (mapped as keyword), initial query takes a lot of time to run also on a decent cluster with decent nodes, for a large corpus. A terms agg query on a 400m docs corpus ran for an hour before we killed it; on 70m docs it returned in 15-20 minutes. Subsequent queries ran fast, hence the attribution of latency to loading of global ordinals. Also, using the map execution hint helps.

This is how the search task looks like (got up to 55 minutes, then we killed the search task):

GET _cat/tasks?v

action                                task_id                       parent_task_id                type      start_time    timestamp running_time ip        node

indices:data/read/search              IJJVLCApT_KsA283KwS_oA:481102 -                             transport 1542704145263 08:55:45  4.7m         10.0.0.11 ip-10-0-0-11

indices:data/read/search[phase/query] RoXZonzjQs-IaFdaIRORww:9599   IJJVLCApT_KsA283KwS_oA:481102 netty     1542704145263 08:55:45  4.7m         10.0.0.33 ip-10-0-0-33
...
...

We observe about 4-5GB of unexplained heap usage - there is no evident monitoring for the space global ordinals occupy and as such we attributes this amount of memory to them.

The same happens when the field is defined as IP datatype, but seems to run even longer (for the 70m case, where it completes). Cardinality aggregation returns exactly the same cardinality value for both keyword and ip variants (on a small dataset with precision_threshold set to 40k), so doc_values for the IP field seems to be persisted correctly. Also, docvalue_fields returns the correct expected values.

Machines are i3 instances on AWS, so CPU, memory and especially disk IO are not a concern.

Looking forward for helpful advice,

Itamar


Why I see fielddata when doc_value is enabled in Aggregations?
(Jimferenczi) #2

Loading of global ordinals seem to sometimes run forever, with no circuit breakers triggered (see numbers below). Is that intentional?

The circuit breaker doesn't check the time that a query takes, only the amount of memory that is created.
Loading global ordinals can be expensive on large shards/indices with high cardinality field. The doc is maybe a bit optimistic and could mention workarounds if loading is slow or if the searcher is refreshed often (each refresh of the main searcher invalidates the loaded global ordinals).

  1. Seems like fields defined with IP datatype in the mapping take even longer to load (10-20% slower) - is there a reason for that? should we use multi-field (IP datatype for search, keyword for aggs)? Can we avoid multi-field?

The IP datatype encodes any ip in a keyword field so I guess that the decoding makes the loading of global ordinals slower. The ip field is represented as a number internally in 2.x, in 5.x we added the support for ipv6 and switched to a keyword representation. This means that the terms aggregation will use the global ordinals execution hint by default for this field. If the loading of global ordinals is too slow in your use case you can switch to the map execution which will be slower and requires more memory per query but doesn't require to load a shared resource on heap.

  1. Other aggregations, most notably the date histogram aggregation, run perfectly well and very fast also on fields with very high cardinality. Is there a way to use the same technique for terms aggregation as well? in our case approximate counts will be ok too.

Date histograms work with numbers directly so they don't build global ordinals at all. The terms aggregation should be fast when the global ordinals are built so there's nothing that we do better in date histograms, it's just that they work on different type of data. For keyword field we try to avoid loading a map that contains all unique terms for each query which is why the default mode is to build the global ordinals. You can opt out from this default by setting the execution_hint to map.

  1. The docs for the map execution hint state (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html): "Please note that Elasticsearch will ignore this execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints." . Since the map execution hint is (currently) required for such queries to run, can the docs state those "not applicable" scenarios explicitly?

Currently it is ignored if you set the execution_hint to global_ordinals and global ordinals are not available for your source (script, field mapped as numbers, ...). map should always be honored.

  1. Can global ordinals be monitored somehow? I couldn't find any obvious metric that would return the amount of memory it consumes, and in our case where significant amount of heap space is consumed, we just attribute it to global ordinals, true or not.

They are monitored in the fielddata stats, we could have a dedicated section but if you don't have any text field that loads fielddata then the memory reported can be attributed to global ordinals entirely.


(Itamar Syn-Hershko) #3

Thanks so much for the prompt and very detailed answer!

Got a few follow-up questions if I may:

Can this be improved somehow? In some systems, a very high cardinality of a field is a given situation, and executing terms agg will essentially overload the cluster. Aside from fixing the doc, maybe the circuit breaker mechanism can be revisited to be hesitant of loading global ordinals for high cardinality fields, or check why it didn't trip in our scenario (we are estimating global ordinals took ~4GB)?

This is exactly what we figured, except we couldn't find anywhere rule-of-thumbs for memory usage difference between map / global_ordinals executions. Can you provide any?

Also, any plans for making global ordinals off-heap if at all possible?

I'm afraid that's not what we are seeing, at least not always. Currently using 6.5.1 and running the same tests we did before. We see the JVM heap grow significantly, but none of the Lucene Memory graphs grow accordingly. The fielddata stats remain very low, and loading time still spikes.

Here's a graph showing what we both expect to see - terms agg on 50m docs, on high cardinality field saved as keyword field (~80% of docs). Global ordinals are loaded relatively fast, and field data stats are shown:

The only worrying element in this graph is JVM heap shows a ~1.5GB growth but fielddata is only shown as 180MB, while no other significant workload occurs.

The first attempt, also shown in this graph, was done before - terms agg on an IP field with higher cardinality (unique term per document, almost). That aggregation took a while to run (5 minutes or so beginning ~14:20) and shows no growth in fielddata.

Then we went bigger. An index with 175m documents and a high cardinality keyword field (~80% of docs have unique value). Query execution took about 45 minutes, loaded about 10GB of data to the heap (+ one full GC) and no indication whatsoever in Lucene Memory stats:

This is the use-case I referred to, and to sum it all up:

  1. Stats are not showing correctly, sometimes not at all
  2. Global ordinals for 175m docs take exponentially more time to load than 50m (single shard). 230m docs didn't return also after one full hour.
  3. No circuit breakers triggered despite loading 10GB (maybe even more) during 45 minutes.

What do you think?


(Jimferenczi) #4

Yes sorry I was not clear, the memory used by global ordinals is tracked but not when they are built, only at the end when the build succeeded. This is due to the fact that the build is achieved by Lucene which doesn't have a concept of circuit breaker so we check the memory afterward.

  1. Global ordinals for 175m docs take exponentially more time to load than 50m (single shard). 230m docs didn't return also after one full hour.

What is the size of the fielddata at the end of the loading for the 175m docs ? In the attached graph it seems that the loading is not over so this would explain why the fielddata is shown at 0.

  1. No circuit breakers triggered despite loading 10GB (maybe even more) during 45 minutes.

It uses 10GB but we don't know the amount of memory that is really required. I need to do some tests to see the ratio between the memory needed to build the global ordinals and the final memory usage. Can you share a bit more of your testing deployment ? How many nodes and shards did you use for the 175M index ?


(Itamar Syn-Hershko) #5

The graphs shown were captured after the initial loading is finished. We waited for the query to return (for as long as it took), executed the query again to verify it returns instantaneously (with caches disabled) and then captured the screenshots. Not seeing the fielddata graph spike was exactly my wtf moment as well :slight_smile:

Like I wrote previously, the JVM heap memory usage is the only indication we have for how much memory that process took. I believe we are talking GBs of data, and that is exactly why I was surprised there was no circuit breakers involved.

Hopefully this is all clear now.

Maybe that's the culprit then?

I wrote the machine specs etc above in detail. Single shard on a 3 master 3 data cluster, no replicas, so we had a quiet server to test on. Which more info do you need?


(Jimferenczi) #6

50 minutes to load the global ordinals of (175/3)M documents seem very high, 15-20 minutes to load 70M too.
I tested with a keyword field on 30M docs and the global ordinals are loaded in 30s on my laptop. Can you share some hot threads of the nodes when the global ordinals are loading ? Can you also share the query that you use for your tests and some sample data ?


(Itamar Syn-Hershko) #7

I agree. What we saw is exponential growth in the time it takes to load global ordinals for high cardinality fields, so 30M docs with a field with 30M unique values might take 30s but 70M will take much longer.

The query was a simple terms aggregation on the high cardinality field, with match_all or any other type of query. We can send hot_threads next week.

We can share our full sample data we use for testing (on S3), we can send details privately.