Timeouts in cluster management requests ES 7.11.2 leading to nodes in the cluster freezing

parapar · May 13, 2021, 6:21am

Hi,

I opened this bug https://github.com/elastic/elasticsearch/issues/72853, but the elastic team did not consider it a bug. My situation is quite similar to this other Kibana ticket https://github.com/elastic/kibana/issues/84041. However, we stopped Kibana and the symptoms persist. So it seems to me that it is an ES 7.11 problem. The situation is as follows.

We have 2 data nodes (Standard_L8s_v2) with a voting only node for breaking ties in master voting. We have a small number of indices and shards. The data nodes choose around 32GB of Heap out of 64GB system RAM. The JVM Heap pressure is low (~25%). Both nodes are in the same virtual network and region. The connection among them seems to be working ok.

{
"cluster_name" : "elastic7",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 2,
"active_primary_shards" : 36,
"active_shards" : 72,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

But from time to time, we observe timeouts on cluster management requests, leading to situations where a node is removed from the cluster, or even the node seems frozen. Example os log entries:

[2021-05-07T15:24:45,544][WARN ][o.e.t.InboundHandler ] [es_data_1] handling inbound transport message [InboundMessage{Header{2659950}{7.11.2}{362663}{false}{false}{false}{false}{NO_ACTION_NAME_FOR_RESPONSES}}] took [36905ms] which is above the warn threshold of [5000ms]

When looking for that, I found this confirmed bug https://github.com/elastic/elasticsearch/issues/65405. However, it should be solved in 7.11.2.

We even tried to increase the timeouts settings, but after some time, we observe entries like this in the log:

[2021-05-13T06:03:46,254][WARN ][o.e.c.InternalClusterInfoService] [es_data_1] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2021-05-13T06:05:31,259][WARN ][o.e.c.InternalClusterInfoService] [es_data_1] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2021-05-13T06:07:16,264][WARN ][o.e.c.InternalClusterInfoService] [es_data_1] Failed to update node information for ClusterInfoUpdateJob within 15s timeout

And finally, the node freezes. We checked GC behaviour, but the GC logs do not show any abnormal GC times. We enabled the slow log for checking on queries around those moments. Queries are indeed recorded some of the times during those episodes, but those queries are not "slow" at any other time. This happens quite often, so it is not an isolated incident. If we restart the faulty node's elasticsearch service, the situation goes back to normality, so it does not seem to be any hardware issue. We even tried redeploying the whole cluster, and the situation persists.

Any help will be really appreciated.

DavidTurner · May 13, 2021, 8:47am

Could you capture stack dumps from the stuck node while it's stuck? Either enable JFR and grab a flight recording, or just run jstack every few seconds.

parapar · May 13, 2021, 10:05am

Hi David!

Thank you for your answer, I did some jstack dumps in the past:

are they enough?

DavidTurner · May 13, 2021, 10:30am

Yes, that explains it:

com.linknovate.elasticsearch.TransportSmartFeedAction.getTermVectors(TransportSmartFeedAction.java:418)

You're using a plugin with a bad bug, it's making blocking calls on transport threads.

parapar · May 13, 2021, 2:20pm

Hi again,

Thank you for the tip. That line is simply making a call to the prepareSearch on the client method to run a query against the index.

SearchResponse searchResponse = client.prepareSearch(INDEX).setQuery(queryBuilder).setSize(docIds.length).get();
Is it not the correct way of running it? We are guessing that the node should answer the request and the thread being freed. Maybe that request to be blocked is a side effect of the node being frozen and not the reason for it? We will examine it anyway.

Thank you again!!

DavidTurner · May 13, 2021, 3:22pm

No, there's definitely blocking happening on this thread which would explain your symptoms. Are you the author of this plugin? If so, run your tests with the JVM option -ea, you should find you're violating some assertions here.

parapar · June 7, 2021, 6:08am

We replaced the plugin, and it seems to be working now. Thank you a lot for your help!!. However, the -ea flag was not spotting any assertion violation. We were processing an additional search request from the plugin itself. We tested with both sync and async search request, and both ended up with the blocking behaviour. This was not happening on ES 6. We did some workaround for removing it from the plugin, and now it is solved. Thank you again for pointing us on the right path.

DavidTurner · June 7, 2021, 6:44am

That's good news, thanks for letting us know.

The assertion I expected to be failing was this one:

github.com

elastic/elasticsearch/blob/v7.11.2/server/src/main/java/org/elasticsearch/common/util/concurrent/BaseFuture.java#L75


 * The default {@link BaseFuture} implementation throws {@code
 * InterruptedException} if the current thread is interrupted before or during
 * the call, even if the value is already available.
 *
 * @throws InterruptedException  if the current thread was interrupted before
 *                               or during the call (optional but recommended).
 * @throws CancellationException {@inheritDoc}
 */
@Override
public V get() throws InterruptedException, ExecutionException {
    assert blockingAllowed();
    return sync.get();
}


// protected so that it can be overridden in specific instances
protected boolean blockingAllowed() {
    return Transports.assertNotTransportThread(BLOCKING_OP_REASON) &&
        ThreadPool.assertNotScheduleThread(BLOCKING_OP_REASON) &&
        ClusterApplierService.assertNotClusterStateUpdateThread(BLOCKING_OP_REASON) &&
        MasterService.assertNotMasterUpdateThread(BLOCKING_OP_REASON);
}

By default this will fail on a transport_worker thread as seen in the stack dumps you shared, but it is of course possible the plugin is overriding blockingAllowed() to claim that the blocking it's doing causes no harm. I couldn't find the plugin source to verify that possibility.

system · July 5, 2021, 6:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High node JVM heap cause ES cluster almost stop working Elasticsearch	6	1747	February 24, 2019
Long period of querying failure during node timeout Elasticsearch	4	1039	May 15, 2020
Elastic Search sudden blackout / cluster down Elasticsearch	1	403	February 19, 2018
Elastic request timeout after disconnecting half cluster Elasticsearch	1	382	February 21, 2020
Cluster issue -> raiseTimeoutFailure Elasticsearch	2	407	July 6, 2017

Timeouts in cluster management requests ES 7.11.2 leading to nodes in the cluster freezing

Related topics