[circuit_breaking_exception] [parent] Data too large

Hi all,

since this morning I am getting the following errors from time to time in my ECK elasticsearch 7.9.0 cluster.

Version: 7.9.0
Build: 33813
Error
    at Fetch._callee3$ (https://kibana.problem.cluster/33813/bundles/core/core.entry.js:34:109213)
    at l (https://kibana.problem.cluster/33813/bundles/kbn-ui-shared-deps/kbn-ui-shared-deps.js:368:155323)
    at Generator._invoke (https://kibana.problem.cluster/33813/bundles/kbn-ui-shared-deps/kbn-ui-shared-deps.js:368:155076)
    at Generator.forEach.e.<computed> [as next] (https://kibana.problem.cluster/33813/bundles/kbn-ui-shared-deps/kbn-ui-shared-deps.js:368:155680)
    at fetch_asyncGeneratorStep (https://kibana.problem.cluster/33813/bundles/core/core.entry.js:34:102354)
    at _next (https://kibana.problem.cluster/33813/bundles/core/core.entry.js:34:102670)
[circuit_breaking_exception] [parent] Data too large, data for [<http_request>] would be [3173199800/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3173199800/2.9gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=56042/54.7kb, model_inference=0/0b, accounting=33495964/31.9mb], with { bytes_wanted=3173199800 & bytes_limit=3060164198 & durability="PERMANENT" }

Any idea what could be going wrong or how I can fix that?
Thanks in advance

It seems like the data pods are running out of heap, which is strange as the cluster is not really under high load currently, since we are still at the beginning of the migration.

I have for now run a kubectl rollout restart elastic-es-data command to restart all data pods. Which restarts them. Looks like the pods are using less heap after the restart again. However, still would be nice to know the reason.

You should look at why the heap size is so high for these nodes.
Do you have Monitoring enabled?

Lets say I am collecting the logs for the elastic data, master and filebeat pods as well.
Any specific keywords, I should look in the logs?

here are a few logs that I have grepped out of the logs:
I hope they can give any indications to the root of this issue.

{"type": "server", "timestamp": "2020-08-27T04:46:28,063Z", "level": "INFO", "component": "o.e.c.m.MetadataIndexTemplateService", "cluster.name": "elastic", "node.name": "elastic-es-master-1", "message": "adding template [filebeat-7.9.0] for index patterns [*-filebeat-*]", "cluster.uuid": "OfB8GyE3S-GoLHQr9se2BA", "node.id": "OrD-B3H0SRyhhlT3lHxE8Q"  }
{"type": "server", "timestamp": "2020-08-27T04:46:25,659Z", "level": "WARN", "component": "o.e.c.m.MetadataIndexTemplateService", "cluster.name": "elastic", "node.name": "elastic-es-master-1", "message": "legacy template [filebeat-7.9.0] has index patterns [*-filebeat-*] matching patterns from existing composable templates [metrics,logs] with patterns (metrics => [metrics-*-*],logs => [logs-*-*]); this template [filebeat-7.9.0] may be ignored in favor of a composable template at index creation time", "cluster.uuid": "OfB8GyE3S-GoLHQr9se2BA", "node.id": "OrD-B3H0SRyhhlT3lHxE8Q"  }
legacy template [filebeat-7.9.0] has index patterns [*-filebeat-*] matching patterns from existing composable templates [metrics,logs] with patterns (metrics => [metrics-*-*],logs => [logs-*-*]); this template [filebeat-7.9.0] may be ignored in favor of a composable template at index creation time
{"type": "server", "timestamp": "2020-08-27T04:46:21,101Z", "level": "WARN", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "elastic", "node.name": "elastic-es-data-4", "message": "[gc][595855] overhead, spent [4.2s] collecting in the last [4.6s]", "cluster.uuid": "OfB8GyE3S-GoLHQr9se2BA", "node.id": "IZ25d43RRmyWhulynsgSGQ"  }
{"type": "server", "timestamp": "2020-08-27T04:46:15,752Z", "level": "INFO", "component": "o.e.i.b.HierarchyCircuitBreakerService", "cluster.name": "elastic", "node.name": "elastic-es-data-4", "message": "attempting to trigger G1GC due to high heap usage [3199937824]", "cluster.uuid": "OfB8GyE3S-GoLHQr9se2BA", "node.id": "IZ25d43RRmyWhulynsgSGQ"  }
{"type": "server", "timestamp": "2020-08-27T04:46:15,771Z", "level": "INFO", "component": "o.e.i.b.HierarchyCircuitBreakerService", "cluster.name": "elastic", "node.name": "elastic-es-data-4", "message": "GC did bring memory usage down, before [3199937824], after [3178531696], allocations [1], duration [19]", "cluster.uuid": "OfB8GyE3S-GoLHQr9se2BA", "node.id": "IZ25d43RRmyWhulynsgSGQ"  }

As side node - data-4 pod was not restarted. So I triggered this manually this morning.
Now that all pods have been restarted heap size looks normal again.

Cluster State

  "_nodes" : {
    "total" : 9,
    "successful" : 9,
    "failed" : 0
  },
  "cluster_name" : "elastic",
  "cluster_uuid" : "OfB8GyE3S-GoLHQr9se2BA",
  "timestamp" : 1598504956540,
  "status" : "green",
  "indices" : {
    "count" : 451,
    "shards" : {
      "total" : 2632,
      "primaries" : 1316,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 6,
          "avg" : 5.835920177383592
        },
        "primaries" : {
          "min" : 1,
          "max" : 3,
          "avg" : 2.917960088691796
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },

Looking forward to find the root cause.

That's concerning. How big is your heap?

Given that each node only has a 3GB heap according to the monitoring screenshots above I think you have far too many shards given the data volume. I would recommend reading this blog post and try to reduce that dramatically.

our nodes are sized as followed

elasticMaster:
  storage: 20Gi
  javaHeapSize: 3
  requestMem: 6Gi
  requestCPU: 0.5
  limitMem: 6Gi
  limitCPU: 2
elasticData:
  name: data
  nodes: 6
  storage: 100Gi
  javaHeapSize: 3
  requestMem: 6Gi
  requestCPU: 0.5
  limitMem: 6Gi
  limitCPU: 2

I was running this cluster with with way more shards like a month ago. After that I have changed to weekly indicies, which reduced the amount of indicies and shards dramatically.

What heap size would you recommend for my shard size?
Maybe I should change to monthly indicies :confused:

What is the full output of the cluster stats API?

Based on the monitoring stats it looks like you have an average shard size of less than 100MB which is very inefficient. I would recommend reducing the number of indices/shards instead of increasing heap.

Thats correct

{
  "_nodes" : {
    "total" : 9,
    "successful" : 9,
    "failed" : 0
  },
  "cluster_name" : "elastic",
  "cluster_uuid" : "OfB8GyE3S-GoLHQr9se2BA",
  "timestamp" : 1598510414082,
  "status" : "green",
  "indices" : {
    "count" : 409,
    "shards" : {
      "total" : 2380,
      "primaries" : 1190,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 6,
          "avg" : 5.819070904645477
        },
        "primaries" : {
          "min" : 1,
          "max" : 3,
          "avg" : 2.9095354523227384
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 163460330,
      "deleted" : 1494
    },
    "store" : {
      "size_in_bytes" : 126901367143,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 0,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 81240,
      "total_count" : 4845,
      "hit_count" : 23,
      "miss_count" : 4822,
      "cache_size" : 58,
      "cache_count" : 69,
      "evictions" : 11
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 17357,
      "memory_in_bytes" : 177868364,
      "terms_memory_in_bytes" : 146435328,
      "stored_fields_memory_in_bytes" : 10583208,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 7800640,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 13049188,
      "index_writer_memory_in_bytes" : 272484784,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 34198608,
      "max_unsafe_auto_id_timestamp" : 1598504818572,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "alias",
          "count" : 6528,
          "index_count" : 192
        },
        {
          "name" : "binary",
          "count" : 28,
          "index_count" : 7
        },
        {
          "name" : "boolean",
          "count" : 20031,
          "index_count" : 308
        },
        {
          "name" : "date",
          "count" : 20005,
          "index_count" : 408
        },
        {
          "name" : "double",
          "count" : 5296,
          "index_count" : 192
        },
        {
          "name" : "flattened",
          "count" : 262,
          "index_count" : 68
        },
        {
          "name" : "float",
          "count" : 5516,
          "index_count" : 350
        },
        {
          "name" : "geo_point",
          "count" : 1660,
          "index_count" : 254
        },
        {
          "name" : "geo_shape",
          "count" : 5,
          "index_count" : 5
        },
        {
          "name" : "integer",
          "count" : 154,
          "index_count" : 9
        },
        {
          "name" : "ip",
          "count" : 20892,
          "index_count" : 192
        },
        {
          "name" : "keyword",
          "count" : 503208,
          "index_count" : 408
        },
        {
          "name" : "long",
          "count" : 172995,
          "index_count" : 370
        },
        {
          "name" : "nested",
          "count" : 304,
          "index_count" : 203
        },
        {
          "name" : "object",
          "count" : 121685,
          "index_count" : 409
        },
        {
          "name" : "scaled_float",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "short",
          "count" : 19393,
          "index_count" : 193
        },
        {
          "name" : "text",
          "count" : 21924,
          "index_count" : 407
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [
        {
          "name" : "pattern_capture",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "uax_url_email",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "unique",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 9,
      "coordinating_only" : 0,
      "data" : 6,
      "ingest" : 6,
      "master" : 3,
      "ml" : 6,
      "remote_cluster_client" : 9,
      "transform" : 6,
      "voting_only" : 0
    },
    "versions" : [
      "7.9.0"
    ],
    "os" : {
      "available_processors" : 18,
      "allocated_processors" : 18,
      "names" : [
        {
          "name" : "Linux",
          "count" : 9
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 9
        }
      ],
      "mem" : {
        "total_in_bytes" : 57982058496,
        "free_in_bytes" : 8685391872,
        "used_in_bytes" : 49296666624,
        "free_percent" : 15,
        "used_percent" : 85
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 54
      },
      "open_file_descriptors" : {
        "min" : 509,
        "max" : 3635,
        "avg" : 2552
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 606184692,
      "versions" : [
        {
          "version" : "14.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "14.0.1+7",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 9
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 17460825080,
        "heap_max_in_bytes" : 28991029248
      },
      "threads" : 947
    },
    "fs" : {
      "total_in_bytes" : 693923807232,
      "free_in_bytes" : 558180548608,
      "available_in_bytes" : 558029553664
    },
    "plugins" : [
      {
        "name" : "repository-azure",
        "version" : "7.9.0",
        "elasticsearch_version" : "7.9.0",
        "java_version" : "1.8",
        "description" : "The Azure Repository plugin adds support for Azure storage repositories.",
        "classname" : "org.elasticsearch.repositories.azure.AzureRepositoryPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "repository-s3",
        "version" : "7.9.0",
        "elasticsearch_version" : "7.9.0",
        "java_version" : "1.8",
        "description" : "The S3 repository plugin adds S3 repositories",
        "classname" : "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      },
      {
        "name" : "repository-gcs",
        "version" : "7.9.0",
        "elasticsearch_version" : "7.9.0",
        "java_version" : "1.8",
        "description" : "The GCS repository plugin adds Google Cloud Storage support for repositories.",
        "classname" : "org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false
      }
    ],
    "network_types" : {
      "transport_types" : {
        "security4" : 9
      },
      "http_types" : {
        "security4" : 9
      }
    },
    "discovery_types" : {
      "zen" : 9
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "docker",
        "count" : 9
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 29,
      "processor_stats" : {
        "append" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "conditional" : {
          "count" : 1027800,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 4772
        },
        "convert" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "date" : {
          "count" : 69802,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 321
        },
        "dot_expander" : {
          "count" : 558668,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 161
        },
        "grok" : {
          "count" : 1568667,
          "failed" : 680562,
          "current" : 0,
          "time_in_millis" : 1955
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "json" : {
          "count" : 286078,
          "failed" : 6796,
          "current" : 0,
          "time_in_millis" : 314
        },
        "remove" : {
          "count" : 888112,
          "failed" : 134674,
          "current" : 0,
          "time_in_millis" : 818
        },
        "rename" : {
          "count" : 1656105,
          "failed" : 36,
          "current" : 0,
          "time_in_millis" : 727
        },
        "script" : {
          "count" : 409172,
          "failed" : 134892,
          "current" : 0,
          "time_in_millis" : 3092
        },
        "set" : {
          "count" : 1492570,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 1445
        },
        "split" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        }
      }
    }
  }
}

Bigger sized indicies


smaller sized indicies

The first thing I would recommend would be to switch to using a single primary shard and not split by region. That should if I calculate correctly reduce the shard count by a factor of 6. None of the indices you show require more than a single primary shard.

You could also potentially try to add dedicated ingest nodes so that processing moves off the data nodes. That should lower the heap pressure but I am not sure how much difference it would make. You could then direct all traffic through these nodes and let them handle ingest as well as query coordination.

2 Likes

Cool thanks I will go with this first and check if the head size gets more stable

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.