Index inconsistency on 0.16.2 after a network partition

Hey,
We've been running 16.2 for a while with no problems (other than the
memory leak that is fixed in 0.16.5 and we're in the process of moving
to that release).

However, a couple of nights ago we an incident with our core router
that caused a network partition. We have four nodes and it appears
that node 1 was disconnected from nodes 2,3,4.

The 2,3,4 cluster went into a yellow state and began recovering data
from each other and the single node cluster(1) went to the red state.
I am not sure if anything corruption occurred at this point.

In order to rectify, I took the following steps:

  • Shutdown the single node cluster
  • Started back up the node that had been orphaned and it rejoined the
    main 2,3,4 cluster
  • Replayed transactions that had occurred while the cluster was split

This should have restored the cluster back to a correct state, but
something appears to have gone wrong when node 1 was disconnected or
rejoined the cluster. Three of the larger indexes ended up in an
inconsistent state. Depending which node was queried, different counts
would come back. These were the indexes effected:

idol-ft_20110513220131
idol-nab_20110513220132
idol-reports1_20110513220132

For example, when here are the counts I get when I hit idol-
ft_20110513220131 from all 4 nodes:
1 - 1154320
2 - 1079486
3 - 1080016
4 - 1228060 - This is the correct count

Log files, cluster state and config file can be viewed here:

In order to address, I needed to rebuild the indexes from our backend
storage and swapped aliases to make the new indices live.

I still have the inconsistent indices available if there are any
details you want from them.

I had done extensive testing around this scenario on 0.16.2 and had
never reproduced this. This seems different than some of the index
corruption issues that occurred with 0.14.2 that were fixed in 0.16,
as the destruction that occurred in 0.14 was much more severe and
resulted in indices getting completely wiped vs inconsistent with all
of the data being available sometimes.

Please let me know what I can do to help on this. I'll lend whatever
support necessary to help address.

Thanks,
Paul

Hey,

Once you would have restarted node1, then it should have sync'ed with the
other nodes and provide consistent view of the data. There were
some enhancements done to this process in 0.17, but its strange that you ran
into it... .

Btw, in 0.17, you can use the discovery.zen.minimum_master_nodes setting
which would have solved this case of network partitioning (for example, if
you have it set at 2).

-shay.banon

On Thu, Sep 8, 2011 at 7:42 PM, ppearcy ppearcy@gmail.com wrote:

Hey,
We've been running 16.2 for a while with no problems (other than the
memory leak that is fixed in 0.16.5 and we're in the process of moving
to that release).

However, a couple of nights ago we an incident with our core router
that caused a network partition. We have four nodes and it appears
that node 1 was disconnected from nodes 2,3,4.

The 2,3,4 cluster went into a yellow state and began recovering data
from each other and the single node cluster(1) went to the red state.
I am not sure if anything corruption occurred at this point.

In order to rectify, I took the following steps:

  • Shutdown the single node cluster
  • Started back up the node that had been orphaned and it rejoined the
    main 2,3,4 cluster
  • Replayed transactions that had occurred while the cluster was split

This should have restored the cluster back to a correct state, but
something appears to have gone wrong when node 1 was disconnected or
rejoined the cluster. Three of the larger indexes ended up in an
inconsistent state. Depending which node was queried, different counts
would come back. These were the indexes effected:

idol-ft_20110513220131
idol-nab_20110513220132
idol-reports1_20110513220132

For example, when here are the counts I get when I hit idol-
ft_20110513220131 from all 4 nodes:
1 - 1154320
2 - 1079486
3 - 1080016
4 - 1228060 - This is the correct count

Log files, cluster state and config file can be viewed here:
Dropbox - Link Disabled - Simplify your life

In order to address, I needed to rebuild the indexes from our backend
storage and swapped aliases to make the new indices live.

I still have the inconsistent indices available if there are any
details you want from them.

I had done extensive testing around this scenario on 0.16.2 and had
never reproduced this. This seems different than some of the index
corruption issues that occurred with 0.14.2 that were fixed in 0.16,
as the destruction that occurred in 0.14 was much more severe and
resulted in indices getting completely wiped vs inconsistent with all
of the data being available sometimes.

Please let me know what I can do to help on this. I'll lend whatever
support necessary to help address.

Thanks,
Paul

Hey,
Thanks for the details. A couple of follow up questions, if you have
a few more moments:

  • I am thinking about increasing the discovery.zen.ping_timeout to
    pretty large number. Something between 5-10 minutes. My understanding
    and what I've seen from testing, leads me to believe this should cause
    any new operations to block until the timeout is reached or the
    network is restored. If a node is brought down cleanly, this shouldn't
    come into play. Is that correct and are there any negatives I haven't
    mentioned?

  • I am not sure that the discovery.zen.minimum_master_nodes would have
    made a difference in this case. I had a split with 1 vs 3 nodes and
    the inconsistency started when either they were split or when they
    were rejoined. It seems the split/rejoin would have occurred even with
    that setting.

  • Is there an easy way to tell which replica of a shard is giving
    which counts? I believe this may be possible with Luke, but not very
    easy to do. If I had known which shards had gotten into a bad state, I
    should have been able to restore them by stopping the node with the
    bad shard and wiping the data for that index, forcing recovery from
    master.

Thanks for all the great work that goes into ES and I appreciate all
the time spent answering questions like this.

Best Regards,
Paul

On Sep 8, 11:42 am, Shay Banon kim...@gmail.com wrote:

Hey,

Once you would have restarted node1, then it should have sync'ed with the
other nodes and provide consistent view of the data. There were
some enhancements done to this process in 0.17, but its strange that you ran
into it... .

Btw, in 0.17, you can use the discovery.zen.minimum_master_nodes setting
which would have solved this case of network partitioning (for example, if
you have it set at 2).

-shay.banon

On Thu, Sep 8, 2011 at 7:42 PM, ppearcy ppea...@gmail.com wrote:

Hey,
We've been running 16.2 for a while with no problems (other than the
memory leak that is fixed in 0.16.5 and we're in the process of moving
to that release).

However, a couple of nights ago we an incident with our core router
that caused a network partition. We have four nodes and it appears
that node 1 was disconnected from nodes 2,3,4.

The 2,3,4 cluster went into a yellow state and began recovering data
from each other and the single node cluster(1) went to the red state.
I am not sure if anything corruption occurred at this point.

In order to rectify, I took the following steps:

  • Shutdown the single node cluster
  • Started back up the node that had been orphaned and it rejoined the
    main 2,3,4 cluster
  • Replayed transactions that had occurred while the cluster was split

This should have restored the cluster back to a correct state, but
something appears to have gone wrong when node 1 was disconnected or
rejoined the cluster. Three of the larger indexes ended up in an
inconsistent state. Depending which node was queried, different counts
would come back. These were the indexes effected:

idol-ft_20110513220131
idol-nab_20110513220132
idol-reports1_20110513220132

For example, when here are the counts I get when I hit idol-
ft_20110513220131 from all 4 nodes:
1 - 1154320
2 - 1079486
3 - 1080016
4 - 1228060 - This is the correct count

Log files, cluster state and config file can be viewed here:
Dropbox - Link Disabled - Simplify your life

In order to address, I needed to rebuild the indexes from our backend
storage and swapped aliases to make the new indices live.

I still have the inconsistent indices available if there are any
details you want from them.

I had done extensive testing around this scenario on 0.16.2 and had
never reproduced this. This seems different than some of the index
corruption issues that occurred with 0.14.2 that were fixed in 0.16,
as the destruction that occurred in 0.14 was much more severe and
resulted in indices getting completely wiped vs inconsistent with all
of the data being available sometimes.

Please let me know what I can do to help on this. I'll lend whatever
support necessary to help address.

Thanks,
Paul

On Mon, Sep 12, 2011 at 11:18 PM, ppearcy ppearcy@gmail.com wrote:

Hey,
Thanks for the details. A couple of follow up questions, if you have
a few more moments:

  • I am thinking about increasing the discovery.zen.ping_timeout to
    pretty large number. Something between 5-10 minutes. My understanding
    and what I've seen from testing, leads me to believe this should cause
    any new operations to block until the timeout is reached or the
    network is restored. If a node is brought down cleanly, this shouldn't
    come into play. Is that correct and are there any negatives I haven't
    mentioned?

You mean the fault detection timeout between nodes? Those settings are
here:
Elasticsearch Platform — Find real-time answers at scale | Elastic.
When there is a clean shutdown of a node, or actually socket failure, then
disconnection will be detected extremely fast. When sending a request and
not getting a response, then those fd timeouts gets into play.

I do not recommend setting those to too high value, since using the
minimum_master_nodes setting will be affected by it (the nodes will not be
identified as disconnected).

  • I am not sure that the discovery.zen.minimum_master_nodes would have
    made a difference in this case. I had a split with 1 vs 3 nodes and
    the inconsistency started when either they were split or when they
    were rejoined. It seems the split/rejoin would have occurred even with
    that setting.

It should have helped. What would have happened is that the 1 node
disconnected would have gotten into a state where it would not have shards
allocated on it, and it will go into a rejoin process automatically. Your
logic is sound, as in, the fact that you shutdown the node and reconnected
it should have resolved any potential conflicts, but I believe what you saw
is solved in 0.17 (at least based on my tests).

  • Is there an easy way to tell which replica of a shard is giving
    which counts? I believe this may be possible with Luke, but not very
    easy to do. If I had known which shards had gotten into a bad state, I
    should have been able to restore them by stopping the node with the
    bad shard and wiping the data for that index, forcing recovery from
    master.

Yes, when you execute a search and set the explain flag to true, it will
also return the shard and node each hit comes back from.

Thanks for all the great work that goes into ES and I appreciate all
the time spent answering questions like this.

Best Regards,
Paul

On Sep 8, 11:42 am, Shay Banon kim...@gmail.com wrote:

Hey,

Once you would have restarted node1, then it should have sync'ed with
the
other nodes and provide consistent view of the data. There were
some enhancements done to this process in 0.17, but its strange that you
ran
into it... .

Btw, in 0.17, you can use the discovery.zen.minimum_master_nodes
setting
which would have solved this case of network partitioning (for example,
if
you have it set at 2).

-shay.banon

On Thu, Sep 8, 2011 at 7:42 PM, ppearcy ppea...@gmail.com wrote:

Hey,
We've been running 16.2 for a while with no problems (other than the
memory leak that is fixed in 0.16.5 and we're in the process of moving
to that release).

However, a couple of nights ago we an incident with our core router
that caused a network partition. We have four nodes and it appears
that node 1 was disconnected from nodes 2,3,4.

The 2,3,4 cluster went into a yellow state and began recovering data
from each other and the single node cluster(1) went to the red state.
I am not sure if anything corruption occurred at this point.

In order to rectify, I took the following steps:

  • Shutdown the single node cluster
  • Started back up the node that had been orphaned and it rejoined the
    main 2,3,4 cluster
  • Replayed transactions that had occurred while the cluster was split

This should have restored the cluster back to a correct state, but
something appears to have gone wrong when node 1 was disconnected or
rejoined the cluster. Three of the larger indexes ended up in an
inconsistent state. Depending which node was queried, different counts
would come back. These were the indexes effected:

idol-ft_20110513220131
idol-nab_20110513220132
idol-reports1_20110513220132

For example, when here are the counts I get when I hit idol-
ft_20110513220131 from all 4 nodes:
1 - 1154320
2 - 1079486
3 - 1080016
4 - 1228060 - This is the correct count

Log files, cluster state and config file can be viewed here:
Dropbox - Link Disabled - Simplify your life

In order to address, I needed to rebuild the indexes from our backend
storage and swapped aliases to make the new indices live.

I still have the inconsistent indices available if there are any
details you want from them.

I had done extensive testing around this scenario on 0.16.2 and had
never reproduced this. This seems different than some of the index
corruption issues that occurred with 0.14.2 that were fixed in 0.16,
as the destruction that occurred in 0.14 was much more severe and
resulted in indices getting completely wiped vs inconsistent with all
of the data being available sometimes.

Please let me know what I can do to help on this. I'll lend whatever
support necessary to help address.

Thanks,
Paul