Shard Recovery takes long due to engine is closed

Hey guys,

First off, we are running ES 2.4.2.

The last few weeks we encountered a nasty behaviour of ES in regards to share recovery.

When i restart a data node and the shard recovery process beginns i see the following "informations" in the debug log:

[2017-03-09 07:33:16,701][DEBUG][index.shard              ] [node] [logstash-2017.02.23][11] updateBufferSize: engine is closed; skipping
[2017-03-09 07:33:16,701][DEBUG][index.shard              ] [node] [logstash-2017.02.23][3] updateBufferSize: engine is closed; skipping
[2017-03-09 07:33:16,701][DEBUG][index.shard              ] [node] [logstash-2017.02.23][4] updateBufferSize: engine is closed; skipping
[2017-03-09 07:33:16,701][DEBUG][index.shard              ] [node] [logstash-2017.02.23][7] updateBufferSize: engine is closed; skipping
[2017-03-09 07:33:16,701][DEBUG][index.shard              ] [node] [logstash-2017.02.22][4] updateBufferSize: engine is closed; skipping
[2017-03-09 07:33:16,701][DEBUG][index.shard              ] [node] [logstash-2017.02.22][17] updateBufferSize: engine is closed; skipping

While this is no error in particular, as long as this messages appear that mentioned shard dont get recovered, cause of this the recovery of 40 shards can take up to 1 hour (regardless of the disk performance etc.)

The node have fairly high cpu load while recovering the indices, could this lead into some sort of "timeout" for shard recovery leading into this message?

Any feedback is appreciated.

Regards

Hi @german23,

I am observing a similar problems in ES 2.4.1 and 2.4.4. I see exactly the same log messages and cluster behaves in the same way.

When these issues are happening I also see a build up of unanswered tasks on certain nodes in the cluster - mostly cluster:monitor/nodes/info - I'm using ".../_tasks" REST request to monitor them.

Do you see anything like that? Have you found a solution by any chance?

Please let me know.

Thank you!

HI @gndcshv,

I did not really find a solution to this problem but some sort of "workaround".

I set "node_initial_primaries_recoveries" to 10 on the cluster level, after that the initial recovery process seems to get much faster (even tho kopf-plugin still shows all shards as recovering after node restart) and the messages occur less often.

Thanks! Do you recall if you had similar issue with node tasks getting accumulated over time?

Well what i have noticed is that "cluster state updates" getting processed very slowly if this behaviours occured.

Tbh in my opinion the problem of me was the high cpu load on the server as he tried to recover all 113 shards that are located on it. Since we are using a disk compression i think there was too much cpu context swapping going on, with only 10 shard recovering at a time now it (theoretically) is much smoother.

I dunno abour your server, your filesystem etc but look if there is a performance bottleneck (either cpu or disk) while this messages occur

@german23 in our case I triggered this behavior by adding several nodes to the cluster and thus making ES to rebalance shards. I could see short-term bursts of load avg and CPU when that happens, and slightly higher load on the nodes involved in rebalancing activity. I don't see any native transport related errors or exceptions in the logs even though I enabled transport tracer for everything.

This cluster runs without index compression, tcp compression, or disk compression - I do have encryption enabled for the native transport protocol though, so that might be our bottleneck.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.