We have a 4 node cluster running ES 0.14.2. Last night our operations
team needed to reboot our core switches to address an issue they were
seeing (not affecting ES). The switches were rebooted 10 minutes
apart. After the second reboot, we see a cluster disconnect. Then 5
minutes later, we see the disconnected node, sending node disconnects.
Once a node is disconnected, shouldn't the other nodes ignore it?
More troubling is two of the nodes end up deleting nearly all their
index contents. Ultimately we lost ~45% of the index contents, which
we've been able to reapply. We're close to being back operational, but
want to understand why those nodes chose to throw away index data.
Timeline (all times UTC). Our machine names are abbreviated here to
101-104
~02:05 core switch reboot
~02:15 2nd core switch reboot
02:16 101 node unable to ping 104 node, declares itself master
02:21:47 101 node unable to ping 103, 102, removes them from cluster
At this point from disk utilization recording, 102 and 103 nodes
drop index data
02:21:55 101 node sees all 4 nodes and adds them back to cluster
~03:00 I got a call from our NOC and started looking. At this point
both isolated clusters were in a red state. I stopped ES on the
isolated node and then restarted, hoping this would have enough shards
for ES to get out of red state. For an hour it was trying to start the
remaining 4 shards, but couldn't.
04:04 I shut the cluster down and restarted with all 4 nodes. Cluster
came up into yellow and eventually green state, but we had lost a
large number of documents. Several large indexes were empty, several
had 1 or 2 shards worth of data.
At this point I tried another restart, but the data was gone (I was
grasping here). I kicked off our tools to repopulate the missing
indexes.
We have a tool for recording metrics of various OS metrics. With this
we can see that disk space utilization sharply dropped on two of the 4
nodes during some master flop. There seems to be three series of nodes
having connectivity and choosing a master.
The Es machines run ES and two Java applications that become non-data
nodes of the cluster (ESIndexer and ESSearcherServer - performs
searches)
https://gist.github.com/842370 - 101.log
https://gist.github.com/842380 - 102.log
https://gist.github.com/842381 - 103.log
https://gist.github.com/842386 - 104.log
https://gist.github.com/842638 - elasticsearch.yml
https://gist.github.com/842662 - disk space utilization
Any recommendations for avoiding data loss in the future?
Thanks
David