Detecting Data Loss

Last night we lost 3 nodes in our cluster. From the logging I have it looks like they died sequentially over the course of about 2 hours. The cause of why they became unresponsive is not yet known. I was able to SSH into the machines this morning, but unable to restart the elasticsearch service (they were no longer seen as part of the cluster) or even run ps -aux.
The nodes all have ephemeral disks (Google Cloud). When I rebooted the nodes the amount of free space on each of the disks was substantially higher than before, but the disks were not blank, and I didn't stop the node, so I would have expected the disks to remain intact.

What I'd like help with is understanding how I can see if we lost data due to the conditions. I could restore the indicies from our backups, but I'd like to know if there is a more straightforward way to tell if shards / segments were deleted and data was lost (without having to remember how many docs I should have in each of my indicies).

The easiest way would be to have a X-Pack basic license with Monitoring to a secondary cluster, then you could just look at the stats. Without that, you are flying blind.

I'm saving logs off the box and aggregating them. What log line am I looking for?

We don't log this sort of thing.

What about if, during startup, a shard that was previously on the node is missing or corrupted? Any sort of indication that there were shenanigans at the FS level beneath elastic?

Then it will log that.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.