EL7.11 Index recoveries stuck at translog stage

alexolivan · March 10, 2021, 3:15pm

Hi community.

I'm experiencing a problem I never saw before during index recovery... after the loss of a server, where a master and data nodes where running (something I've seen EL clustering managing exceptionally well in the past) the cluster seems to have get stuck in the recovery process:

Initially the cluster was red, with Shard recovery stuck (I gave several hours to see some progress), on a series of indices that hold no documents (a series of indices where I send docs with parsing failures, typically grok parsing).
File and Bytes were all done... it was translog that was stuck stating n/a.
Since the whole cluster data ingest was totally lost, I deleted those empty indexes, and health went yellow, and ingest seemed to resume.... but the problem continues.

Once yellow, with unasigned shards, ingest of data resumed.... but when looking shard activity, I see the indices doing peer recovery, with Files and Bytes at 100% , but again, translog either stuck at 0% (stating X/X complete, so 100%) or stating 'n/a'.
Now but the process seems to timeout or something , because the indices seem to repeat this state in a kind try/fail/retry loop.

Searching on forums about this kind of translog issues, I've found similar cases, sometimes bug/issue related, in versions 2.x 5.X and 6.X.... so maybe this could be yet another such case.

Any clue why could this happen? How could I remedy this? Or maybe I'm missing some documentation where I could fine-tune my setup to prevent this issue?

EDIT: It seems the problem could be related with the server going down in hte midle of a scheduled snapshot

Thanks.
regards.

system · April 7, 2021, 3:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Translog recovery stuck[ES 6.0] Elasticsearch	5	1280	September 30, 2019
ES 2.1 shards stuck in translog recovery Elasticsearch	14	6024	July 5, 2017
ES 2.3.5, shard stuck in Translog stage Elasticsearch	2	525	May 12, 2017
Translog is corrupted Elasticsearch	3	3338	November 1, 2021
Translog files corrupted, cluster failing to recover Elasticsearch	2	1706	July 5, 2017

EL7.11 Index recoveries stuck at translog stage

Related topics