Elasticsearch 7.17.3: [parent] Data too large, data for [cluster:monitor/nodes/stats[n]]

Hello,

We recently upgraded from ES 7.9.2 to 7.17.3 (in preparation for the subsequent 8.x upgrade) and noticed that the cluster spends a lot more time in Yellow state than it used to. The root cause appears to be the "circuit_breaking_exception" errors shown by _nodes/stats/breaker API with an example included below.

I understand the purpose of circuit breakers but I'd like to understand what they mean in [cluster:monitor/nodes/stats[n]] context. Are such exception always caused by data nodes or they can be triggered on master or coordinator nodes as well?

All our data nodes are running with 30GB java heap space on instances with 64+ GB RAM. Master nodes have 10GB heap space with 32GB total RAM. Coordinator nodes have 24GB heap size on instances with 32GB RAM.

These are the jvm configs we are using "-XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true"

What would you recommend as remediation steps here?

{
        "type" : "failed_node_exception",
        "reason" : "Failed node [Pm8J_qCaRHuMycKV66e3DA]",
        "node_id" : "Pm8J_qCaRHuMycKV66e3DA",
        "caused_by" : {
          "type" : "circuit_breaking_exception",
          "reason" : "[parent] Data too large, data for [cluster:monitor/nodes/stats[n]] would be [30822705860/28.7gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30822551600/28.7gb], new bytes reserved: [154260/150.6kb], usages [request=0/0b, fielddata=2143218578/1.9gb, in_flight_requests=154260/150.6kb, model_inference=0/0b, eql_sequence=0/0b, accounting=1212327420/1.1gb]",
          "bytes_wanted" : 30822705860,
          "bytes_limit" : 30601641984,
          "durability" : "PERMANENT"
        }
      }

Thanks!

What is the output from the _cluster/stats?pretty&human API?

Hi Mark, as I was looking through the response from that API, I noticed the following:

"versions" : [
      {
        "version" : "7.9.2",
        "index_count" : 34,
        "primary_shard_count" : 34,
        "total_primary_size" : "28.3gb",
        "total_primary_bytes" : 30458649323
      },
      {
        "version" : "7.17.3",
        "index_count" : 109,
        "primary_shard_count" : 10205,
        "total_primary_size" : "192.5tb",
        "total_primary_bytes" : 211691506218250
      }
    ]

Seems like we have shards with different ES versions. Should this be something of a concern to us?

Is there anything in particular you'd like from that API response, Mark?

Thanks!

That's not a major concern as the version will upgrade as soon as a merge happens.

What causes circuit breaker errors for [cluster:monitor/nodes/stats[n]]?
Any suggestions on how to troubleshoot this?

Thanks!

What is the output from the _cluster/stats?pretty&human API?

Hi Mark, please see the output below.

{
  "_nodes" : {
    "total" : 195,
    "successful" : 195,
    "failed" : 0
  },
  "cluster_name" : "big-cluster",
  "cluster_uuid" : "K-1234aabdqE-12345",
  "timestamp" : 1666659916602,
  "status" : "yellow",
  "indices" : {
    "count" : 143,
    "shards" : {
      "total" : 20274,
      "primaries" : 10239,
      "replication" : 0.9800761793143862,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 1440,
          "avg" : 141.7762237762238
        },
        "primaries" : {
          "min" : 1,
          "max" : 720,
          "avg" : 71.6013986013986
        },
        "replication" : {
          "min" : 0.6958333333333333,
          "max" : 2.0,
          "avg" : 1.158634421134421
        }
      }
    },
    "docs" : {
      "count" : 790445613008,
      "deleted" : 457195189
    },
    "store" : {
      "size" : "382.9tb",
      "size_in_bytes" : 421101292372832,
      "total_data_set_size" : "382.9tb",
      "total_data_set_size_in_bytes" : 421101292372832,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "141.1gb",
      "memory_size_in_bytes" : 151560137480,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "117.6gb",
      "memory_size_in_bytes" : 126286295378,
      "total_count" : 1073032027,
      "hit_count" : 50921816,
      "miss_count" : 1022110211,
      "cache_size" : 462981,
      "cache_count" : 2776301,
      "evictions" : 2313320
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 610297,
      "memory" : "90gb",
      "memory_in_bytes" : 96707140126,
      "terms_memory" : "72.4gb",
      "terms_memory_in_bytes" : 77805275064,
      "stored_fields_memory" : "372.6mb",
      "stored_fields_memory_in_bytes" : 390784984,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "210.6kb",
      "norms_memory_in_bytes" : 215680,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "17.2gb",
      "doc_values_memory_in_bytes" : 18510864398,
      "index_writer_memory" : "22.5gb",
      "index_writer_memory_in_bytes" : 24246476072,
      "version_map_memory" : "183.6mb",
      "version_map_memory_in_bytes" : 192603869,
      "fixed_bit_set" : "666.9gb",
      "fixed_bit_set_memory_in_bytes" : 716148612688,
      "max_unsafe_auto_id_timestamp" : 1666656045836,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "boolean",
          "count" : 731,
          "index_count" : 61,
          "script_count" : 0
        },
        {
          "name" : "constant_keyword",
          "count" : 6,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 1563,
          "index_count" : 128,
          "script_count" : 0
        },
        {
          "name" : "float",
          "count" : 386,
          "index_count" : 34,
          "script_count" : 0
        },
        {
          "name" : "geo_point",
          "count" : 30,
          "index_count" : 30,
          "script_count" : 0
        },
        {
          "name" : "half_float",
          "count" : 120,
          "index_count" : 30,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 365,
          "index_count" : 36,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 104,
          "index_count" : 46,
          "script_count" : 0
        },
        {
          "name" : "join",
          "count" : 14,
          "index_count" : 14,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 29100,
          "index_count" : 127,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 4962,
          "index_count" : 123,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 512,
          "index_count" : 74,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 4305,
          "index_count" : 102,
          "script_count" : 0
        },
        {
          "name" : "short",
          "count" : 15,
          "index_count" : 15,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 695,
          "index_count" : 45,
          "script_count" : 0
        },
        {
          "name" : "version",
          "count" : 4,
          "index_count" : 4,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [ ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 44,
          "index_count" : 44
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "keyword",
          "count" : 44,
          "index_count" : 44
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 44,
          "index_count" : 44
        }
      ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "7.9.2",
        "index_count" : 34,
        "primary_shard_count" : 34,
        "total_primary_size" : "28.3gb",
        "total_primary_bytes" : 30458649323
      },
      {
        "version" : "7.17.3",
        "index_count" : 109,
        "primary_shard_count" : 10205,
        "total_primary_size" : "192.5tb",
        "total_primary_bytes" : 211691506218250
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 195,
      "coordinating_only" : 0,
      "data" : 178,
      "data_cold" : 178,
      "data_content" : 178,
      "data_frozen" : 178,
      "data_hot" : 178,
      "data_warm" : 178,
      "ingest" : 192,
      "master" : 3,
      "ml" : 195,
      "remote_cluster_client" : 195,
      "transform" : 178,
      "voting_only" : 0
    },
    "versions" : [
      "7.17.3"
    ],
    "os" : {
      "available_processors" : 2416,
      "allocated_processors" : 2416,
      "names" : [
        {
          "name" : "Linux",
          "count" : 195
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 195
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 195
        }
      ],
      "mem" : {
        "total" : "17.3tb",
        "total_in_bytes" : 19052991717376,
        "free" : "229.4gb",
        "free_in_bytes" : 246327709696,
        "used" : "17.1tb",
        "used_in_bytes" : 18806664007680,
        "free_percent" : 1,
        "used_percent" : 99
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 1835
      },
      "open_file_descriptors" : {
        "min" : 4561,
        "max" : 7841,
        "avg" : 6420
      }
    },
    "jvm" : {
      "max_uptime" : "131.3d",
      "max_uptime_in_millis" : 11345820318,
      "versions" : [
        {
          "version" : "17.0.3",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "17.0.3+7-LTS",
          "vm_vendor" : "Azul Systems, Inc.",
          "bundled_jdk" : true,
          "using_bundled_jdk" : false,
          "count" : 195
        }
      ],
      "mem" : {
        "heap_used" : "3.5tb",
        "heap_used_in_bytes" : 3941226079168,
        "heap_max" : "5.5tb",
        "heap_max_in_bytes" : 6126770847744
      },
      "threads" : 29723
    },
    "fs" : {
      "total" : "1pb",
      "total_in_bytes" : 1237832941223936,
      "free" : "737.2tb",
      "free_in_bytes" : 810597725245440,
      "available" : "680.3tb",
      "available_in_bytes" : 748049165389824
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 195
      },
      "http_types" : {
        "security4" : 195
      }
    },
    "discovery_types" : {
      "zen" : 195
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "rpm",
        "count" : 195
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 14,
      "processor_stats" : {
        "conditional" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "geoip" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "grok" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "remove" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "rename" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "set" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

Thanks!

That's a pretty huge cluster, so it's not surprising you are seeing this as the data Monitoring will be collecting covers all your nodes.

Is there a reason your cluster is this large? Generally we encourage smaller clusters as they are easier to manage, and then use CCS to be able to gain a single view into all the data.

Splitting this large cluster into smaller ones is not an option for us. We can, however, increase the sizes of master nodes if this is necessary for managing larger cluster state.

I think that we have a problem with memory pressure on the data nodes preventing them from properly releasing memory. This leaves very little memory available for operations, such as collecting monitoring data, which causes circuit breaker thresholds to get breached and exceptions thrown. Our workaround has been to restart data nodes which would allow them to run without issues for a week or so until we see circuit breaker exceptions again. We've also noticed that clearing caches can also help temporarily remediate these issues.

Aside from splitting the cluster, what should we be looking into next? What APIs would you recommend for troubleshooting memory issues?

Thanks!

Splitting the cluster :stuck_out_tongue: It's very large and while you can increase the data nodes to reduce pressure, you're still going to hit the same issues of a cluster that large.

We are also seeing similar circuit breaker exceptions from operations other than monitoring stats collectors. This tells me that even though the large size of the cluster does put more memory pressure on the monitoring operation, it is not really the root cause of our problem. Moreover, we've been running with this size of the cluster for over two years and have not seen circuit breaker exceptions until recently. Around that time we made the following changes to the cluster:

  1. Upgraded from 7.9.2 to 7.17.3
  2. Switched to zulu-17 java version
  3. Added a new type of data which could cause large sizes of caches for field data and query

Here is an example of a different circuit breaker exception: [internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [31165960446/29gb], which is larger than the limit of [30601641984/28.5gb], real usage: [31165958624/29gb], new bytes reserved: [1822/1.7kb], usages [request=0/0b, fielddata=1898653268/1.7gb, in_flight_requests=5468/5.3kb, model_inference=0/0b, eql_sequence=0/0b, accounting=1119165988/1gb]

Are there any known memory issues in ES 7.17.3 and/or zulu-17? What's the best way to troubleshoot memory usage of a particular field in an index?

Thanks!

According to the official support matrix it does not look like zulu-17 is a supported JVM. Might it be worth trying with an officially supported one to see if this is an issue?

According to that matrix, Elasticsearch 7.17.x supports Oracle/OpenJDK**/Temurin
17. This, according to the ** reference shown below, also means that it supports Azul Zulu 17, which is the version we are using.

** Elastic supports some OpenJDK-derived distributions: 1. builds by the IcedTea Project; 2. those produced by OS vendors in the “Product and Operating System” matrix which have passed the TCK tests; 3. Azul Zulu starting with Elasticsearch 6.6.0.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.