Hi,
I opened this bug https://github.com/elastic/elasticsearch/issues/72853, but the elastic team did not consider it a bug. My situation is quite similar to this other Kibana ticket https://github.com/elastic/kibana/issues/84041. However, we stopped Kibana and the symptoms persist. So it seems to me that it is an ES 7.11 problem. The situation is as follows.
We have 2 data nodes (Standard_L8s_v2) with a voting only node for breaking ties in master voting. We have a small number of indices and shards. The data nodes choose around 32GB of Heap out of 64GB system RAM. The JVM Heap pressure is low (~25%). Both nodes are in the same virtual network and region. The connection among them seems to be working ok.
{
"cluster_name" : "elastic7",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 2,
"active_primary_shards" : 36,
"active_shards" : 72,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
But from time to time, we observe timeouts on cluster management requests, leading to situations where a node is removed from the cluster, or even the node seems frozen. Example os log entries:
[2021-05-07T15:24:45,544][WARN ][o.e.t.InboundHandler ] [es_data_1] handling inbound transport message [InboundMessage{Header{2659950}{7.11.2}{362663}{false}{false}{false}{false}{NO_ACTION_NAME_FOR_RESPONSES}}] took [36905ms] which is above the warn threshold of [5000ms]
When looking for that, I found this confirmed bug https://github.com/elastic/elasticsearch/issues/65405. However, it should be solved in 7.11.2.
We even tried to increase the timeouts settings, but after some time, we observe entries like this in the log:
[2021-05-13T06:03:46,254][WARN ][o.e.c.InternalClusterInfoService] [es_data_1] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2021-05-13T06:05:31,259][WARN ][o.e.c.InternalClusterInfoService] [es_data_1] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2021-05-13T06:07:16,264][WARN ][o.e.c.InternalClusterInfoService] [es_data_1] Failed to update node information for ClusterInfoUpdateJob within 15s timeout
And finally, the node freezes. We checked GC behaviour, but the GC logs do not show any abnormal GC times. We enabled the slow log for checking on queries around those moments. Queries are indeed recorded some of the times during those episodes, but those queries are not "slow" at any other time. This happens quite often, so it is not an isolated incident. If we restart the faulty node's elasticsearch service, the situation goes back to normality, so it does not seem to be any hardware issue. We even tried redeploying the whole cluster, and the situation persists.
Any help will be really appreciated.