Cluster partition resulted in loss of data

We have a 4 node cluster running ES 0.14.2. Last night our operations
team needed to reboot our core switches to address an issue they were
seeing (not affecting ES). The switches were rebooted 10 minutes
apart. After the second reboot, we see a cluster disconnect. Then 5
minutes later, we see the disconnected node, sending node disconnects.
Once a node is disconnected, shouldn't the other nodes ignore it?

More troubling is two of the nodes end up deleting nearly all their
index contents. Ultimately we lost ~45% of the index contents, which
we've been able to reapply. We're close to being back operational, but
want to understand why those nodes chose to throw away index data.

Timeline (all times UTC). Our machine names are abbreviated here to
101-104

~02:05 core switch reboot
~02:15 2nd core switch reboot
02:16 101 node unable to ping 104 node, declares itself master
02:21:47 101 node unable to ping 103, 102, removes them from cluster
At this point from disk utilization recording, 102 and 103 nodes
drop index data
02:21:55 101 node sees all 4 nodes and adds them back to cluster
~03:00 I got a call from our NOC and started looking. At this point
both isolated clusters were in a red state. I stopped ES on the
isolated node and then restarted, hoping this would have enough shards
for ES to get out of red state. For an hour it was trying to start the
remaining 4 shards, but couldn't.
04:04 I shut the cluster down and restarted with all 4 nodes. Cluster
came up into yellow and eventually green state, but we had lost a
large number of documents. Several large indexes were empty, several
had 1 or 2 shards worth of data.

At this point I tried another restart, but the data was gone (I was
grasping here). I kicked off our tools to repopulate the missing
indexes.

We have a tool for recording metrics of various OS metrics. With this
we can see that disk space utilization sharply dropped on two of the 4
nodes during some master flop. There seems to be three series of nodes
having connectivity and choosing a master.

The Es machines run ES and two Java applications that become non-data
nodes of the cluster (ESIndexer and ESSearcherServer - performs
searches)
https://gist.github.com/842370 - 101.log
https://gist.github.com/842380 - 102.log
https://gist.github.com/842381 - 103.log
https://gist.github.com/842386 - 104.log
https://gist.github.com/842638 - elasticsearch.yml
https://gist.github.com/842662 - disk space utilization

Any recommendations for avoiding data loss in the future?

Thanks

David

Uploaded output of _status https://gist.github.com/842802
We have replicas = 1

Heya,

Sadly, this can happen in 0.14 (under the mentioned scenario), and it was fixed in 0.15. There were two bugs that could cause data loss in such cases. Another user (using 0.15) got into a similar situation, and no data was lost :).

As a side note, I plan to work on reducing the chances of getting split brain, and recovering from it (with potentially losing some data while the split cluster/brain was going on, depends on the knobs one chooses to set).

Side side note, 0.15.1 will be released early next week. Just a heads up if you want to reduce the number of upgrade cycles.

-shay.banon
On Thursday, February 24, 2011 at 10:15 PM, dbenson wrote:
Uploaded output of _status _status · GitHub

We have replicas = 1

Good to know this has already been fixed. I just reviewed the ES
0.15.0 release notes. Which issue was this addressed on?

The closest I saw was 633. But in this case instead of killing the
node, it was disconnected. Is this another way to reproduce the
scenario?

We will begin upgrading our development environment to 0.15.0, to
examine the list of breaking changes (only the facet one looks
applicable to us).

Thanks,

David

Heya, yea 633 is the one, I fixed it because of the possible race condition, but the same can happen on node disconnection.

If you can test 0.15 and see if it solves the problem (simulate it), then it would be great, since if it doesn't and its a different bug, we can work on solving it. Thanks!

-shay.banon
On Friday, February 25, 2011 at 3:55 PM, dbenson wrote:

Good to know this has already been fixed. I just reviewed the ES
0.15.0 release notes. Which issue was this addressed on?

The closest I saw was 633. But in this case instead of killing the
node, it was disconnected. Is this another way to reproduce the
scenario?

We will begin upgrading our development environment to 0.15.0, to
examine the list of breaking changes (only the facet one looks
applicable to us).

Thanks,

David