Timeouts in cluster management requests ES 7.11.2 leading to nodes in the cluster freezing

Hi,

I opened this bug https://github.com/elastic/elasticsearch/issues/72853, but the elastic team did not consider it a bug. My situation is quite similar to this other Kibana ticket https://github.com/elastic/kibana/issues/84041. However, we stopped Kibana and the symptoms persist. So it seems to me that it is an ES 7.11 problem. The situation is as follows.

We have 2 data nodes (Standard_L8s_v2) with a voting only node for breaking ties in master voting. We have a small number of indices and shards. The data nodes choose around 32GB of Heap out of 64GB system RAM. The JVM Heap pressure is low (~25%). Both nodes are in the same virtual network and region. The connection among them seems to be working ok.

{
"cluster_name" : "elastic7",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 2,
"active_primary_shards" : 36,
"active_shards" : 72,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

But from time to time, we observe timeouts on cluster management requests, leading to situations where a node is removed from the cluster, or even the node seems frozen. Example os log entries:

[2021-05-07T15:24:45,544][WARN ][o.e.t.InboundHandler ] [es_data_1] handling inbound transport message [InboundMessage{Header{2659950}{7.11.2}{362663}{false}{false}{false}{false}{NO_ACTION_NAME_FOR_RESPONSES}}] took [36905ms] which is above the warn threshold of [5000ms]

When looking for that, I found this confirmed bug https://github.com/elastic/elasticsearch/issues/65405. However, it should be solved in 7.11.2.

We even tried to increase the timeouts settings, but after some time, we observe entries like this in the log:

[2021-05-13T06:03:46,254][WARN ][o.e.c.InternalClusterInfoService] [es_data_1] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2021-05-13T06:05:31,259][WARN ][o.e.c.InternalClusterInfoService] [es_data_1] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2021-05-13T06:07:16,264][WARN ][o.e.c.InternalClusterInfoService] [es_data_1] Failed to update node information for ClusterInfoUpdateJob within 15s timeout

And finally, the node freezes. We checked GC behaviour, but the GC logs do not show any abnormal GC times. We enabled the slow log for checking on queries around those moments. Queries are indeed recorded some of the times during those episodes, but those queries are not "slow" at any other time. This happens quite often, so it is not an isolated incident. If we restart the faulty node's elasticsearch service, the situation goes back to normality, so it does not seem to be any hardware issue. We even tried redeploying the whole cluster, and the situation persists.

Any help will be really appreciated.

Could you capture stack dumps from the stuck node while it's stuck? Either enable JFR and grab a flight recording, or just run jstack every few seconds.

Hi David!

Thank you for your answer, I did some jstack dumps in the past:

are they enough?

Yes, that explains it:

com.linknovate.elasticsearch.TransportSmartFeedAction.getTermVectors(TransportSmartFeedAction.java:418)

You're using a plugin with a bad bug, it's making blocking calls on transport threads.

2 Likes

Hi again,

Thank you for the tip. That line is simply making a call to the prepareSearch on the client method to run a query against the index.

SearchResponse searchResponse = client.prepareSearch(INDEX).setQuery(queryBuilder).setSize(docIds.length).get();
Is it not the correct way of running it? We are guessing that the node should answer the request and the thread being freed. Maybe that request to be blocked is a side effect of the node being frozen and not the reason for it? We will examine it anyway.

Thank you again!!

No, there's definitely blocking happening on this thread which would explain your symptoms. Are you the author of this plugin? If so, run your tests with the JVM option -ea, you should find you're violating some assertions here.

1 Like

We replaced the plugin, and it seems to be working now. Thank you a lot for your help!!. However, the -ea flag was not spotting any assertion violation. We were processing an additional search request from the plugin itself. We tested with both sync and async search request, and both ended up with the blocking behaviour. This was not happening on ES 6. We did some workaround for removing it from the plugin, and now it is solved. Thank you again for pointing us on the right path.

That's good news, thanks for letting us know.

The assertion I expected to be failing was this one:

By default this will fail on a transport_worker thread as seen in the stack dumps you shared, but it is of course possible the plugin is overriding blockingAllowed() to claim that the blocking it's doing causes no harm. I couldn't find the plugin source to verify that possibility.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.