I'm experiencing a problem I never saw before during index recovery... after the loss of a server, where a master and data nodes where running (something I've seen EL clustering managing exceptionally well in the past) the cluster seems to have get stuck in the recovery process:
Initially the cluster was red, with Shard recovery stuck (I gave several hours to see some progress), on a series of indices that hold no documents (a series of indices where I send docs with parsing failures, typically grok parsing).
File and Bytes were all done... it was translog that was stuck stating n/a.
Since the whole cluster data ingest was totally lost, I deleted those empty indexes, and health went yellow, and ingest seemed to resume.... but the problem continues.
Once yellow, with unasigned shards, ingest of data resumed.... but when looking shard activity, I see the indices doing peer recovery, with Files and Bytes at 100% , but again, translog either stuck at 0% (stating X/X complete, so 100%) or stating 'n/a'.
Now but the process seems to timeout or something , because the indices seem to repeat this state in a kind try/fail/retry loop.
Searching on forums about this kind of translog issues, I've found similar cases, sometimes bug/issue related, in versions 2.x 5.X and 6.X.... so maybe this could be yet another such case.
Any clue why could this happen? How could I remedy this? Or maybe I'm missing some documentation where I could fine-tune my setup to prevent this issue?
EDIT: It seems the problem could be related with the server going down in hte midle of a scheduled snapshot