I'm hoping someone can help me understand what is causing this exception. This is being thrown frequently, both while writing to and reading from the cluster.
Here's an example error I received while running GET /_cat/indices?v
in Kibana:
{
"error": {
"root_cause": [
{
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [4075745992/3.7gb], which is larger than the limit of [4063657984/3.7gb], real usage: [4075745992/3.7gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=8989/8.7kb, in_flight_requests=0/0b, accounting=1803266/1.7mb]",
"bytes_wanted": 4075745992,
"bytes_limit": 4063657984,
"durability": "PERMANENT"
}
],
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [4075745992/3.7gb], which is larger than the limit of [4063657984/3.7gb], real usage: [4075745992/3.7gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=8989/8.7kb, in_flight_requests=0/0b, accounting=1803266/1.7mb]",
"bytes_wanted": 4075745992,
"bytes_limit": 4063657984,
"durability": "PERMANENT"
},
"status": 429
}
My cluster has 3 master nodes and 3 data nodes. The cluster has 2 indexes and each index has 2 shards (with replica count set to 1).
From what I've read, this error indicates that I've reached 95% heap usage on at least one node. But when I add up the usage from the circuit breakers (request
+ fielddata
+ in_flight_requests
+ accounting
), they never total more than ~20mb. So something else must be responsible for the memory usage.
I noticed that the cluster was in yellow status which I narrowed down to an allocation failure assigning replicas to the units2
index. I removed the replicas which caused the cluster status to return to green and I stopped seeing errors for a while. This made me think that replication was using too much memory and causing the issue. To test this theory, I let the cluster run overnight without the replicas. Unfortunately this morning I found thousands of new CircuitBreakingExceptions.
I'm not sure what to look at next and would appreciate any assistance you can provide. For some context, I've run several commands this morning to looks for clues. I've copied the output of those commands below.
Here's the output of the GET /_cat/indices?v
command:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open units2 AHcurH6cTASFSj4AF1q7rQ 2 0 14147313 1107187 1.8gb 1.8gb
green open .kibana_2 _Uo50jEPQOK9iBGbh9zw4w 1 1 3.7kb
green open places2 L0T_uvxZR8maIVwO3d44hw 2 1 6356 2570 45.7mb 19.6mb
green open .kibana_1 OousjPfkSHeySiLefOdGOw 1 1 283b
The output of GET /_cat/shards?v
:
index shard prirep state docs store ip node
.kibana_2 0 p STARTED 1 3.7kb x.x.x.x 451a15942b572d7159f0736533a7533b
.kibana_2 0 r STARTED 1 3.7kb x.x.x.x 7bcda2e106963bc7c4099a16d057b265
.kibana_1 0 r STARTED 0 283b x.x.x.x 66c9c69b225cb26bb1988e6427d529a8
.kibana_1 0 p STARTED 0 283b x.x.x.x 451a15942b572d7159f0736533a7533b
places2 1 r STARTED 6405 12.8mb x.x.x.x 66c9c69b225cb26bb1988e6427d529a8
places2 1 p STARTED 6405 13.3mb x.x.x.x 451a15942b572d7159f0736533a7533b
places2 0 p STARTED 6356 19.6mb x.x.x.x 66c9c69b225cb26bb1988e6427d529a8
places2 0 r STARTED 6356 13.3mb x.x.x.x 7bcda2e106963bc7c4099a16d057b265
units2 1 p STARTED 14353629 2.1gb x.x.x.x 451a15942b572d7159f0736533a7533b
units2 0 p STARTED 14147313 1.8gb x.x.x.x 7bcda2e106963bc7c4099a16d057b265
The output of GET /_cluster/stats
:
{
"_nodes" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"cluster_name" : "843863714247:search-00",
"cluster_uuid" : "z-d6Y0FwRXikLrOSwBlPxg",
"timestamp" : 1587481456218,
"status" : "green",
"indices" : {
"count" : 4,
"shards" : {
"total" : 10,
"primaries" : 6,
"replication" : 0.6666666666666666,
"index" : {
"shards" : {
"min" : 2,
"max" : 4,
"avg" : 2.5
},
"primaries" : {
"min" : 1,
"max" : 2,
"avg" : 1.5
},
"replication" : {
"min" : 0.0,
"max" : 1.0,
"avg" : 0.75
}
}
},
"docs" : {
"count" : 28513707,
"deleted" : 3754420
},
"store" : {
"size_in_bytes" : 4137770567
},
"fielddata" : {
"memory_size_in_bytes" : 18752,
"evictions" : 0
},
"query_cache" : {
"memory_size_in_bytes" : 1109584,
"total_count" : 3410091,
"hit_count" : 613985,
"miss_count" : 2796106,
"cache_size" : 65,
"cache_count" : 36427,
"evictions" : 36362
},
"completion" : {
"size_in_bytes" : 0
},
"segments" : {
"count" : 63,
"memory_in_bytes" : 4344431,
"terms_memory_in_bytes" : 1655163,
"stored_fields_memory_in_bytes" : 1085760,
"term_vectors_memory_in_bytes" : 0,
"norms_memory_in_bytes" : 128576,
"points_memory_in_bytes" : 867196,
"doc_values_memory_in_bytes" : 607736,
"index_writer_memory_in_bytes" : 0,
"version_map_memory_in_bytes" : 0,
"fixed_bit_set_memory_in_bytes" : 102640,
"max_unsafe_auto_id_timestamp" : -1,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 6,
"coordinating_only" : 0,
"data" : 3,
"ingest" : 3,
"master" : 3
},
"versions" : [ "7.4.2" ],
"os" : {
"available_processors" : 12,
"allocated_processors" : 12,
"names" : [ {
"count" : 6
} ],
"pretty_names" : [ {
"count" : 6
} ],
"mem" : {
"total_in_bytes" : 35828772864,
"free_in_bytes" : 4621955072,
"used_in_bytes" : 31206817792,
"free_percent" : 13,
"used_percent" : 87
}
},
"process" : {
"cpu" : {
"percent" : 106
},
"open_file_descriptors" : {
"min" : 1403,
"max" : 1506,
"avg" : 1445
}
},
"jvm" : {
"max_uptime_in_millis" : 2994403385,
"mem" : {
"heap_used_in_bytes" : 13192197672,
"heap_max_in_bytes" : 19222757376
},
"threads" : 759
},
"fs" : {
"total_in_bytes" : 656313581568,
"free_in_bytes" : 644325363712,
"available_in_bytes" : 644224700416
},
"network_types" : {
"transport_types" : {
"com.amazon.opendistroforelasticsearch.security.ssl.http.netty.OpenDistroSecuritySSLNettyTransport" : 6
},
"http_types" : {
"filter-jetty" : 6
}
},
"discovery_types" : {
"zen" : 6
},
"packaging_types" : [ {
"flavor" : "oss",
"type" : "tar",
"count" : 6
} ]
}
}