Circuit break exception after upgrading from ES 7.16.2 to ES 8.11.4

Hello,

After upgrading ES 7.16.2 --> ES 8.11.4, search queries which were running fine on ES7 failed in ES8. This is unexpected for me, especially when a new cluster has better resources :thinking:

The setup of data nodes:

  • ES7 -> 4 data nodes, each ~4GB memory for ES and ~4GB for OS (Ubuntu 20)
  • ES8 -> 4 data nodes, each ~8GB memory for ES and ~8GB for OS (Ubuntu 22)

The queries are heavy, but they work on ES7, making me think it's very ES8 related. I checked the settings on both clusters for breakers - no difference.

The actual error is:

"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"product_34_126_t","node":"tf_L3iH8T8igkOzQMSBKDw","reason":{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<reused_arrays>] would be [7964077776/7.4gb], which is larger than the limit of [7961208422/7.4gb], real usage: [7958064848/7.4gb], new bytes reserved: [6012928/5.7mb], usages [eql_sequence=0/0b, fielddata=18659348/17.7mb, request=9007952/8.5mb, inflight_requests=5090/4.9kb, model_inference=0/0b]","bytes_wanted":7964077776,"bytes_limit":7961208422,"durability":"PERMANENT"}}]

Here are some logs from that node (there are many of them, all of the same type):

[2024-05-08T15:58:49,871][INFO ][o.e.m.j.JvmGcMonitorService] [eu-test-dataHorse-1] [gc][2619] overhead, spent [268ms] collecting in the last [1s]
[2024-05-08T15:58:52,686][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [eu-test-dataHorse-1] attempting to trigger G1GC due to high heap usage [8297270888]
[2024-05-08T15:58:52,702][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [eu-test-dataHorse-1] memory usage down after [0], before [8297270888], after [8261635704]
[2024-05-08T15:58:52,702][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [eu-test-dataHorse-1] GC did bring memory usage down, before [8297270888], after [8261635704], allocations [17], duration [17]
[2024-05-08T15:58:52,710][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [eu-test-dataHorse-1] memory usage not down after [8], before [8299384440], after [8299384440]
[2024-05-08T15:58:52,710][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [eu-test-dataHorse-1] memory usage not down after [8], before [8299384440], after [8299384440]
[2024-05-08T15:58:52,716][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [eu-test-dataHorse-1] memory usage not down after [14], before [8299384440], after [8299384440]
[2024-05-08T15:58:52,717][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [eu-test-dataHorse-1] memory usage not down after [15], before [8299384440], after [8299384440]

I appreciate any ideas :slight_smile:

There are some changes to JVM usage/management, possibly significant.

I run a few exaclty the same searches on both clusters, and here are the charts. Same query, almost the same index size.

Old - ES7.16

New - ES8.11

Not sure how you upgrade from 7.16 for 8, as this upgrade path is not supported and should've not work. Did you upgrade first to 7.17?

Are you using the bundled java? Or another java? There was an issue on 8.11 related to Java 21, but I think it was already fixed on 8.11.4

Also, if this is a bug you need to upgrade to the last version an replicate it, 8.11 will not receive any fixes anymore.

Can you upgrade to 8.13 and see if this is still happening?

Thanks for your support and time!

I have created a new cluster ES 8.11.4, and restored snapshots from 7.16.2.
(I have this luxury :slight_smile: )

Yes, I'm using a bundled Java.

Unfortunately, I'm bound to one of the custom plugins for ES. Its latest version supports only ES 8.11.4 :frowning_face:

Currently, I'm reindexing indices restored for snapshot of ES 7.16.2.
I will see if this helps somehow, maybe it's a Lucene-related issue.

The next step would be to upgrade to the latest ES version to check if it resolves the issue.

Hello,

An update: on the ES8.11.4 I have downgraded JDK to 17.07 ( the same I have on the old ES7 cluster) and instructed ES to use unbundled JDK.
It didn't change how memory is managed by the node. It makes me think, it's related not only to JDK, but ES8 itself has some changes.

This issue occurs for me when I have many products in results in combination with heavy aggregations (like cardinality).

My node settings:

"tf_L3iH8T8igkOzQMSBKDw": {
      "name": "node-1",
      "transport_address": "10.8.10.35:9300",
      "host": "10.8.10.35",
      "ip": "10.8.10.35",
      "version": "8.11.4",
      "transport_version": 8512001,
      "index_version": 8500003,
      "component_versions": {
        "transform_config_version": 10000099,
        "ml_config_version": 11000099
      },
      "build_flavor": "default",
      "build_type": "deb",
      "build_hash": "da06c53fd49b7e676ccf8a32d6655c5155c16d81",
      "roles": [
        "data",
        "remote_cluster_client"
      ],
      "attributes": {
        "ml.config_version": "11.0.0",
        "type": "ReadOptimized",
        "xpack.installed": "true",
        "transform.config_version": "10.0.0"
      },
      "jvm": {
        "pid": 550308,
        "version": "17.0.7",
        "vm_name": "OpenJDK 64-Bit Server VM",
        "vm_version": "17.0.7+7",
        "vm_vendor": "Eclipse Adoptium",
        "using_bundled_jdk": false,
        "start_time": "2024-05-10T10:32:50.171Z",
        "start_time_in_millis": 1715337170171,
        "mem": {
          "heap_init": "7.8gb",
          "heap_init_in_bytes": 8380219392,
          "heap_max": "7.8gb",
          "heap_max_in_bytes": 8380219392,
          "non_heap_init": "7.3mb",
          "non_heap_init_in_bytes": 7667712,
          "non_heap_max": "0b",
          "non_heap_max_in_bytes": 0,
          "direct_max": "0b",
          "direct_max_in_bytes": 0
        },
        "gc_collectors": [
          "G1 Young Generation",
          "G1 Old Generation"
        ],
        "memory_pools": [
          "CodeHeap 'non-nmethods'",
          "Metaspace",
          "CodeHeap 'profiled nmethods'",
          "Compressed Class Space",
          "G1 Eden Space",
          "G1 Old Gen",
          "G1 Survivor Space",
          "CodeHeap 'non-profiled nmethods'"
        ],
        "using_compressed_ordinary_object_pointers": "true",
        "input_arguments": [
          "-Des.networkaddress.cache.ttl=60",
          "-Des.networkaddress.cache.negative.ttl=10",
          "-Djava.security.manager=allow",
          "-XX:+AlwaysPreTouch",
          "-Xss1m",
          "-Djava.awt.headless=true",
          "-Dfile.encoding=UTF-8",
          "-Djna.nosys=true",
          "-XX:-OmitStackTraceInFastThrow",
          "-Dio.netty.noUnsafe=true",
          "-Dio.netty.noKeySetOptimization=true",
          "-Dio.netty.recycler.maxCapacityPerThread=0",
          "-Dlog4j.shutdownHookEnabled=false",
          "-Dlog4j2.disable.jmx=true",
          "-Dlog4j2.formatMsgNoLookups=true",
          "-Djava.locale.providers=SPI,COMPAT",
          "--add-opens=java.base/java.io=org.elasticsearch.preallocate",
          "-XX:+UseG1GC",
          "-Djava.io.tmpdir=/tmp/elasticsearch-17834061241544733114",
          "-XX:+HeapDumpOnOutOfMemoryError",
          "-XX:+ExitOnOutOfMemoryError",
          "-XX:HeapDumpPath=/var/lib/elasticsearch",
          "-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log",
          "-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m",
          "-Xms7990m",
          "-Xmx7990m",
          "-XX:+UnlockDiagnosticVMOptions",
          "-XX:+G1UsePreventiveGC",
          "-XX:MaxDirectMemorySize=4190109696",
          "-XX:G1HeapRegionSize=4m",
          "-XX:InitiatingHeapOccupancyPercent=30",
          "-XX:G1ReservePercent=15",
          "-Des.distribution.type=deb",
          "--module-path=/usr/share/elasticsearch/lib",
          "--add-modules=jdk.net",
          "--add-modules=ALL-MODULE-PATH",
          "-Djdk.module.main=org.elasticsearch.server"
        ]
      }
    },