Detecting Data Loss


(Jeff Bolle) #1

Last night we lost 3 nodes in our cluster. From the logging I have it looks like they died sequentially over the course of about 2 hours. The cause of why they became unresponsive is not yet known. I was able to SSH into the machines this morning, but unable to restart the elasticsearch service (they were no longer seen as part of the cluster) or even run ps -aux.
The nodes all have ephemeral disks (Google Cloud). When I rebooted the nodes the amount of free space on each of the disks was substantially higher than before, but the disks were not blank, and I didn't stop the node, so I would have expected the disks to remain intact.

What I'd like help with is understanding how I can see if we lost data due to the conditions. I could restore the indicies from our backups, but I'd like to know if there is a more straightforward way to tell if shards / segments were deleted and data was lost (without having to remember how many docs I should have in each of my indicies).


(Mark Walkom) #2

The easiest way would be to have a X-Pack basic license with Monitoring to a secondary cluster, then you could just look at the stats. Without that, you are flying blind.


(Jeff Bolle) #3

I'm saving logs off the box and aggregating them. What log line am I looking for?


(Mark Walkom) #4

We don't log this sort of thing.


(Jeff Bolle) #5

What about if, during startup, a shard that was previously on the node is missing or corrupted? Any sort of indication that there were shenanigans at the FS level beneath elastic?


(Mark Walkom) #6

Then it will log that.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.