How often does the control loop run that checks to attempt to re-assign an unassigned replica shard? Is there documentation to this effect? Is there source code I can look at?
Welcome to our community!
What problem are you seeing?
We had a cluster running in yellow status for a few weeks. I understand why, but I can't find documentation of how often the cluster would try to repair this index. Below it appears to be a 3-week gap? Is that accurate?
[2022-10-07T08:16:41,167][WARN ][o.e.a.b.TransportShardBulkAction] [hx3gz6o] [[xxxx][0]] failed to perform __PATH__[s] on replica [xxx][0], node[xxx], [R], s[STARTED], a[id=xxx]
RemoteTransportException[[xxx][__IP__][__PATH__[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [__PATH__], which is larger than the limit of [__PATH__], real usage: [__PATH__], new bytes reserved: [__PATH__]];
Caused by: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [__PATH__], which is larger than the limit of [__PATH__], real usage: [__PATH__], new bytes reserved: [__PATH__]]
[2022-10-07T08:16:41,187][INFO ][o.e.c.r.a.AllocationService] [DEY_Qhr] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[xxx][0]] ...]).
[2022-10-28T06:18:29,136][WARN ][o.e.a.b.TransportShardBulkAction] [hx3gz6o] [[xxxx][0]] failed to perform __PATH__[s] on replica [xxx][0], node[xxx], [R], s[STARTED], a[id=xxx]
RemoteTransportException[[DEY_Qhr][__IP__][__PATH__[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [__PATH__], which is larger than the limit of [__PATH__], real usage: [__PATH__], new bytes reserved: [__PATH__]];
Caused by: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [__PATH__], which is larger than the limit of [__PATH__], real usage: [__PATH__], new bytes reserved: [__PATH__]]
It should be more-or-less consistently checking.
What is the output from the _cluster/stats?pretty&human
API? It might give us some further insight.
{
"_nodes" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"cluster_name" : "xxxx",
"cluster_uuid" : "2mpozck8RqKcMivuvxTdEg",
"timestamp" : 1667336617127,
"status" : "yellow",
"indices" : {
"count" : 7,
"shards" : {
"total" : 29,
"primaries" : 15,
"replication" : 0.9333333333333333,
"index" : {
"shards" : {
"min" : 1,
"max" : 10,
"avg" : 4.142857142857143
},
"primaries" : {
"min" : 1,
"max" : 5,
"avg" : 2.142857142857143
},
"replication" : {
"min" : 0.0,
"max" : 1.0,
"avg" : 0.8571428571428571
}
}
},
"docs" : {
"count" : 238719302,
"deleted" : 91184055
},
"store" : {
"size" : "1.2tb",
"size_in_bytes" : 1335819766994
},
"fielddata" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "86.2mb",
"memory_size_in_bytes" : 90392008,
"total_count" : 1269546,
"hit_count" : 640867,
"miss_count" : 628679,
"cache_size" : 11724,
"cache_count" : 11724,
"evictions" : 0
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 451,
"memory" : "87mb",
"memory_in_bytes" : 91330236,
"terms_memory" : "4mb",
"terms_memory_in_bytes" : 4231872,
"stored_fields_memory" : "45.5mb",
"stored_fields_memory_in_bytes" : 47808096,
"term_vectors_memory" : "36.1mb",
"term_vectors_memory_in_bytes" : 37947600,
"norms_memory" : "377.1kb",
"norms_memory_in_bytes" : 386240,
"points_memory" : "0b",
"points_memory_in_bytes" : 0,
"doc_values_memory" : "934kb",
"doc_values_memory_in_bytes" : 956428,
"index_writer_memory" : "4.8mb",
"index_writer_memory_in_bytes" : 5091256,
"version_map_memory" : "644b",
"version_map_memory_in_bytes" : 644,
"fixed_bit_set" : "71.2mb",
"fixed_bit_set_memory_in_bytes" : 74750944,
"max_unsafe_auto_id_timestamp" : -1,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 3,
"data" : 3,
"coordinating_only" : 0,
"master" : 3,
"ingest" : 3
},
"versions" : [ "6.8.0" ],
"os" : {
"available_processors" : 12,
"allocated_processors" : 12,
"names" : [ {
"count" : 3
} ],
"pretty_names" : [ {
"count" : 3
} ],
"mem" : {
"total" : "45.8gb",
"total_in_bytes" : 49262174208,
"free" : "4.4gb",
"free_in_bytes" : 4803670016,
"used" : "41.4gb",
"used_in_bytes" : 44458504192,
"free_percent" : 10,
"used_percent" : 90
}
},
"process" : {
"cpu" : {
"percent" : 21
},
"open_file_descriptors" : {
"min" : 1584,
"max" : 1717,
"avg" : 1628
}
},
"jvm" : {
"max_uptime" : "54.8d",
"max_uptime_in_millis" : 4738694465,
"mem" : {
"heap_used" : "11.4gb",
"heap_used_in_bytes" : 12293693848,
"heap_max" : "23.9gb",
"heap_max_in_bytes" : 25665208320
},
"threads" : 494
},
"fs" : {
"total" : "2.9tb",
"total_in_bytes" : 3246361178112,
"free" : "1.1tb",
"free_in_bytes" : 1273565753344,
"available" : "1tb",
"available_in_bytes" : 1108588687360
},
"network_types" : {
"transport_types" : {
"netty4" : 3
},
"http_types" : {
"filter-jetty" : 3
}
}
}
}
@warkolm any thoughts?
Do you have logs from your master node you could share? The more the better.
Please upgrade, this version is EOL.
We are working on an upgrade soon. I ended up rebuilding the cluster since we needed to get out of the yellow zone. Thanks for the help.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.