Shard Recovery takes long due to engine is closed

german23 · March 10, 2017, 7:53am

Hey guys,

First off, we are running ES 2.4.2.

The last few weeks we encountered a nasty behaviour of ES in regards to share recovery.

When i restart a data node and the shard recovery process beginns i see the following "informations" in the debug log:

[2017-03-09 07:33:16,701][DEBUG][index.shard              ] [node] [logstash-2017.02.23][11] updateBufferSize: engine is closed; skipping
[2017-03-09 07:33:16,701][DEBUG][index.shard              ] [node] [logstash-2017.02.23][3] updateBufferSize: engine is closed; skipping
[2017-03-09 07:33:16,701][DEBUG][index.shard              ] [node] [logstash-2017.02.23][4] updateBufferSize: engine is closed; skipping
[2017-03-09 07:33:16,701][DEBUG][index.shard              ] [node] [logstash-2017.02.23][7] updateBufferSize: engine is closed; skipping
[2017-03-09 07:33:16,701][DEBUG][index.shard              ] [node] [logstash-2017.02.22][4] updateBufferSize: engine is closed; skipping
[2017-03-09 07:33:16,701][DEBUG][index.shard              ] [node] [logstash-2017.02.22][17] updateBufferSize: engine is closed; skipping

While this is no error in particular, as long as this messages appear that mentioned shard dont get recovered, cause of this the recovery of 40 shards can take up to 1 hour (regardless of the disk performance etc.)

The node have fairly high cpu load while recovering the indices, could this lead into some sort of "timeout" for shard recovery leading into this message?

Any feedback is appreciated.

Regards

gndcshv · March 21, 2017, 10:27am

Hi @german23,

I am observing a similar problems in ES 2.4.1 and 2.4.4. I see exactly the same log messages and cluster behaves in the same way.

When these issues are happening I also see a build up of unanswered tasks on certain nodes in the cluster - mostly cluster:monitor/nodes/info - I'm using ".../_tasks" REST request to monitor them.

Do you see anything like that? Have you found a solution by any chance?

Please let me know.

Thank you!

german23 · March 21, 2017, 12:09pm

HI @gndcshv,

I did not really find a solution to this problem but some sort of "workaround".

I set "node_initial_primaries_recoveries" to 10 on the cluster level, after that the initial recovery process seems to get much faster (even tho kopf-plugin still shows all shards as recovering after node restart) and the messages occur less often.

gndcshv · March 21, 2017, 4:52pm

Thanks! Do you recall if you had similar issue with node tasks getting accumulated over time?

german23 · March 22, 2017, 6:16am

Well what i have noticed is that "cluster state updates" getting processed very slowly if this behaviours occured.

Tbh in my opinion the problem of me was the high cpu load on the server as he tried to recover all 113 shards that are located on it. Since we are using a disk compression i think there was too much cpu context swapping going on, with only 10 shard recovering at a time now it (theoretically) is much smoother.

I dunno abour your server, your filesystem etc but look if there is a performance bottleneck (either cpu or disk) while this messages occur

gndcshv · March 23, 2017, 8:05am

@german23 in our case I triggered this behavior by adding several nodes to the cluster and thus making ES to rebalance shards. I could see short-term bursts of load avg and CPU when that happens, and slightly higher load on the nodes involved in rebalancing activity. I don't see any native transport related errors or exceptions in the logs even though I enabled transport tracer for everything.

This cluster runs without index compression, tcp compression, or disk compression - I do have encryption enabled for the native transport protocol though, so that might be our bottleneck.

system · April 20, 2017, 8:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
900000ms timeout during shard recovery Elasticsearch	6	3256	July 5, 2017
Shard re-allocation taking a very long time Elasticsearch	16	7848	April 15, 2019
How to avoid/lighten shard recovery after restart? Elasticsearch	2	455	July 6, 2017
A ruined shard caused ES node down Elasticsearch	10	914	December 22, 2017
Is it me or is ES 1.6.0 node startup/recovery slower then before? Elasticsearch	15	1078	July 6, 2017

Shard Recovery takes long due to engine is closed

Related topics