Here are the log files:
d101 - http://dl.dropbox.com/u/12095883/d101-dev-0.15.2.log
d102 - http://dl.dropbox.com/u/12095883/d102-dev-0.15.2.log
d103 - http://dl.dropbox.com/u/12095883/d103-dev-0.15.2.log
Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.
Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.
Let me know if you need any more details or log captures with more
On Mar 10, 5:57 pm, Paul ppea...@gmail.com wrote:
FYI, first encountered the issue in 14.2 and discussed it in this
Upgraded to 15.2 and ran various disaster test cases. Received data
loss with the one described below.
- Three node cluster (1,2, and 3)
- All indexes have replica = 1
- Dropped 1 from the network
- Waited for 2 and 3 to disconnect from 1
- 2 and 3 are in yellow state, with shards rebalancing
- After ~5 minutes (while other cluster was still yellow) brought 1
back onto the network
- A few more minutes later and 1 connects back to 2 and 3
- BOOM! With this one we ended up losing ~15% of the documents
(Started with ~5.5 million docs and ended up with ~4.5 million)
I am getting logs together and will have them posted tonight.