Hey,
We've been running 16.2 for a while with no problems (other than the
memory leak that is fixed in 0.16.5 and we're in the process of moving
to that release).
However, a couple of nights ago we an incident with our core router
that caused a network partition. We have four nodes and it appears
that node 1 was disconnected from nodes 2,3,4.
The 2,3,4 cluster went into a yellow state and began recovering data
from each other and the single node cluster(1) went to the red state.
I am not sure if anything corruption occurred at this point.
In order to rectify, I took the following steps:
- Shutdown the single node cluster
- Started back up the node that had been orphaned and it rejoined the
main 2,3,4 cluster - Replayed transactions that had occurred while the cluster was split
This should have restored the cluster back to a correct state, but
something appears to have gone wrong when node 1 was disconnected or
rejoined the cluster. Three of the larger indexes ended up in an
inconsistent state. Depending which node was queried, different counts
would come back. These were the indexes effected:
idol-ft_20110513220131
idol-nab_20110513220132
idol-reports1_20110513220132
For example, when here are the counts I get when I hit idol-
ft_20110513220131 from all 4 nodes:
1 - 1154320
2 - 1079486
3 - 1080016
4 - 1228060 - This is the correct count
Log files, cluster state and config file can be viewed here:
In order to address, I needed to rebuild the indexes from our backend
storage and swapped aliases to make the new indices live.
I still have the inconsistent indices available if there are any
details you want from them.
I had done extensive testing around this scenario on 0.16.2 and had
never reproduced this. This seems different than some of the index
corruption issues that occurred with 0.14.2 that were fixed in 0.16,
as the destruction that occurred in 0.14 was much more severe and
resulted in indices getting completely wiped vs inconsistent with all
of the data being available sometimes.
Please let me know what I can do to help on this. I'll lend whatever
support necessary to help address.
Thanks,
Paul