Is it possible to add a customized merging strategy to alleviate split-brain impact?

Jing_Liu · March 31, 2014, 6:11pm

Hi ES team,

When split-brain occurs, I found following behaviors on ES during the merge
between A and B (i.e., a group of nodes with master A or B):
Assume we don't know when the split-brain happens and both node groups have
updated their data to some extends:

If A and B have exclusive data separately, all data will be merged
successfully
If A and B have the same record id but different record value (due to
update), ES cannot merge the data and the system is hanging there (aka.
split-brain effect)

For the 2nd case, is it possible to add a customized merging strategy in
ES? Say, if having the same record id but different record value, we take
the record with the latest timestamp.
By this means, I believe we will have less impact from split-brain. Can we
do that? Or will it be added to ES roadmap.

Thanks,
Jing

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5a0515c4-d4dc-4062-a306-775317ec646f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jing_Liu · April 15, 2014, 8:45pm

Anyone, please?

On Monday, March 31, 2014 11:11:56 AM UTC-7, Jing Liu wrote:

Hi ES team,

When split-brain occurs, I found following behaviors on ES during the
merge between A and B (i.e., a group of nodes with master A or B):
Assume we don't know when the split-brain happens and both node groups
have updated their data to some extends:

If A and B have exclusive data separately, all data will be merged
successfully

If A and B have the same record id but different record value (due to
update), ES cannot merge the data and the system is hanging there (aka.
split-brain effect)

For the 2nd case, is it possible to add a customized merging strategy in
ES? Say, if having the same record id but different record value, we
take the record with the latest timestamp.
By this means, I believe we will have less impact from split-brain. Can we
do that? Or will it be added to ES roadmap.

Thanks,
Jing

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9f466dd4-0910-4c5a-a042-f80eab5ecc02%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

brian_yoder · April 16, 2014, 1:43pm

Jing,

I don't have much experience with ES in a production cluster environment;
all my experience has been with the Java API for mapping, bulk load, and
query logic, and with huge databases and things like that. But my 3-node
test ES cluster has gathered some dust over the past few months as other
tasks have loomed (most good; it's just a matter of time and priority). So
your question really intrigued me.

*When split-brain occurs, I found following behaviors on ES during the

merge between A and B (i.e., a group of nodes with master A or B):*
Assume we don't know when the split-brain happens and both node groups
have updated their data to some extends:
- If A and B have exclusive data separately, all data will be merged
successfully
- If A and B have the same record id but different record value (due to
update), ES cannot merge the data and the system is hanging there (aka.
split-brain effect)

Are you saying that case 1 is handled automatically?

*For the 2nd case, is it possible to add a customized merging strategy in
ES? Say, if having the same record id but different record value, we take
the record with the latest timestamp. *
By this means, I believe we will have less impact from split-brain. Can
we do that? Or will it be added to ES roadmap.

I would add a second up-vote to this request.

In the Oracle world of replication, consider two updates, each to the same
record but in a separate node in a replicated cluster. If one update
modifies field A and the other modifies field B, then the most recent
update wins and the previous one's changes are lost. In other words, the
end result of cross-node replication is that either field B's updates are
saved or field A's updates are saved, but not both. Our solution was to
direct all clients to point to one of the Oracle nodes and let replication
flow in only one direction; fail-over means those applications would need
to be re-pointed. Oracle did nothing to help us; it was all up to us.

So your suggestion in the 2nd case makes a lot of sense. No, it's not
perfect. Yes, there can be data loss. Oracle buys palatial headquarters
buildingshttp://media3.s-nbcnews.com/j/MSNBC/Components/Photo/2009/April/090416/090420-sun-oracle-hmed-4p.grid-6x2.jpg,
racing yachtshttp://yachtingworld.media.ipcdigital.co.uk/9097/000007e54/d554/AC34SFJune15-0900.jpg,
and very nice private jetshttp://www.oracleprivatejets.com/images/opjsceptercard.jpgwith their data loss replication, so their replication strategy can't be
all bad! As with the recent additions to the version types to ES 1.1
with the appropriate warnings, the 2nd case as you describe could be
implemented along with its own warnings about exposure to data loss; an
exposure that a use could work around as needed but with their eyes open.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ea91a199-ec5d-4115-b9c9-2457cdab7272%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jing_Liu · April 16, 2014, 6:14pm

Hi Brain,

Thanks for your inputs.
Yes, the above two cases are found during our tests. Case 1 will be handled
automatically. Hopefully could get attention from ES team for the case 2
solution.

Jing

On Wednesday, April 16, 2014 6:43:14 AM UTC-7, InquiringMind wrote:

Jing,

I don't have much experience with ES in a production cluster environment;
all my experience has been with the Java API for mapping, bulk load, and
query logic, and with huge databases and things like that. But my 3-node
test ES cluster has gathered some dust over the past few months as other
tasks have loomed (most good; it's just a matter of time and priority). So
your question really intrigued me.

*When split-brain occurs, I found following behaviors on ES during the

merge between A and B (i.e., a group of nodes with master A or B):*
Assume we don't know when the split-brain happens and both node groups
have updated their data to some extends:
- If A and B have exclusive data separately, all data will be merged
successfully
- If A and B have the same record id but different record value (due to
update), ES cannot merge the data and the system is hanging there (aka.
split-brain effect)

Are you saying that case 1 is handled automatically?

*For the 2nd case, is it possible to add a customized merging strategy in
ES? Say, if having the same record id but different record value, we take
the record with the latest timestamp. *
By this means, I believe we will have less impact from split-brain. Can
we do that? Or will it be added to ES roadmap.

I would add a second up-vote to this request.

In the Oracle world of replication, consider two updates, each to the same
record but in a separate node in a replicated cluster. If one update
modifies field A and the other modifies field B, then the most recent
update wins and the previous one's changes are lost. In other words, the
end result of cross-node replication is that either field B's updates are
saved or field A's updates are saved, but not both. Our solution was to
direct all clients to point to one of the Oracle nodes and let replication
flow in only one direction; fail-over means those applications would need
to be re-pointed. Oracle did nothing to help us; it was all up to us.

So your suggestion in the 2nd case makes a lot of sense. No, it's not
perfect. Yes, there can be data loss. Oracle buys palatial headquarters
buildingshttp://media3.s-nbcnews.com/j/MSNBC/Components/Photo/2009/April/090416/090420-sun-oracle-hmed-4p.grid-6x2.jpg,
racing yachtshttp://yachtingworld.media.ipcdigital.co.uk/9097/000007e54/d554/AC34SFJune15-0900.jpg,
and very nice private jetshttp://www.oracleprivatejets.com/images/opjsceptercard.jpgwith their data loss replication, so their replication strategy can't be
all bad! As with the recent additions to the version types to ES
1.1 with the appropriate warnings, the 2nd case as you describe could be
implemented along with its own warnings about exposure to data loss; an
exposure that a use could work around as needed but with their eyes open.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a229260e-bc27-41be-9ed3-91bfa2bc11a3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · April 16, 2014, 11:30pm

I believe that the Elasticsearch team is more focused on eliminating
split-brain than the after effects of a split brain. Recent comments
indicate that they are actively working on a solution.

The new consensus algorithm (Paxos/RAFT?) will undoubtedly affect how
conflicts are reconciled.

Cheers,

Ivan

On Wed, Apr 16, 2014 at 11:14 AM, Jing Liu jliu@ciphercloud.com wrote:

Hi Brain,

Thanks for your inputs.
Yes, the above two cases are found during our tests. Case 1 will be
handled automatically. Hopefully could get attention from ES team for the
case 2 solution.

Jing

On Wednesday, April 16, 2014 6:43:14 AM UTC-7, InquiringMind wrote:

Jing,

I don't have much experience with ES in a production cluster environment;
all my experience has been with the Java API for mapping, bulk load, and
query logic, and with huge databases and things like that. But my 3-node
test ES cluster has gathered some dust over the past few months as other
tasks have loomed (most good; it's just a matter of time and priority). So
your question really intrigued me.

*When split-brain occurs, I found following behaviors on ES during the

merge between A and B (i.e., a group of nodes with master A or B):*
Assume we don't know when the split-brain happens and both node groups
have updated their data to some extends:
- If A and B have exclusive data separately, all data will be merged
successfully
- If A and B have the same record id but different record value (due to
update), ES cannot merge the data and the system is hanging there (aka.
split-brain effect)

Are you saying that case 1 is handled automatically?

*For the 2nd case, is it possible to add a customized merging strategy
in ES? Say, if having the same record id but different record value, we
take the record with the latest timestamp. *
By this means, I believe we will have less impact from split-brain. Can
we do that? Or will it be added to ES roadmap.

I would add a second up-vote to this request.

In the Oracle world of replication, consider two updates, each to the
same record but in a separate node in a replicated cluster. If one update
modifies field A and the other modifies field B, then the most recent
update wins and the previous one's changes are lost. In other words, the
end result of cross-node replication is that either field B's updates are
saved or field A's updates are saved, but not both. Our solution was to
direct all clients to point to one of the Oracle nodes and let replication
flow in only one direction; fail-over means those applications would need
to be re-pointed. Oracle did nothing to help us; it was all up to us.

So your suggestion in the 2nd case makes a lot of sense. No, it's not
perfect. Yes, there can be data loss. Oracle buys palatial headquarters
buildingshttp://media3.s-nbcnews.com/j/MSNBC/Components/Photo/2009/April/090416/090420-sun-oracle-hmed-4p.grid-6x2.jpg,
racing yachtshttp://yachtingworld.media.ipcdigital.co.uk/9097/000007e54/d554/AC34SFJune15-0900.jpg,
and very nice private jetshttp://www.oracleprivatejets.com/images/opjsceptercard.jpgwith their data loss replication, so their replication strategy can't be
all bad! As with the recent additions to the version types to ES
1.1 with the appropriate warnings, the 2nd case as you describe could be
implemented along with its own warnings about exposure to data loss; an
exposure that a use could work around as needed but with their eyes open.

Brian

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a229260e-bc27-41be-9ed3-91bfa2bc11a3%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/a229260e-bc27-41be-9ed3-91bfa2bc11a3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAkrZXjmaaVnZeGEs06XCHAwnqrbguxFyGyr%3DdNhS%3DY8A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jing_Liu · April 16, 2014, 11:41pm

Thanks Ivan for your response.
Is it possible to know when the new solution will come out? ES 1.2?

Thanks,
Jing

On Wednesday, April 16, 2014 4:30:15 PM UTC-7, Ivan Brusic wrote:

I believe that the Elasticsearch team is more focused on eliminating
split-brain than the after effects of a split brain. Recent comments
indicate that they are actively working on a solution.

The new consensus algorithm (Paxos/RAFT?) will undoubtedly affect how
conflicts are reconciled.

Cheers,

Ivan

On Wed, Apr 16, 2014 at 11:14 AM, Jing Liu <jl...@ciphercloud.com<javascript:>

wrote:

Hi Brain,

Thanks for your inputs.
Yes, the above two cases are found during our tests. Case 1 will be
handled automatically. Hopefully could get attention from ES team for the
case 2 solution.

Jing

On Wednesday, April 16, 2014 6:43:14 AM UTC-7, InquiringMind wrote:

Jing,

I don't have much experience with ES in a production cluster
environment; all my experience has been with the Java API for mapping, bulk
load, and query logic, and with huge databases and things like that. But my
3-node test ES cluster has gathered some dust over the past few months as
other tasks have loomed (most good; it's just a matter of time and
priority). So your question really intrigued me.

*When split-brain occurs, I found following behaviors on ES during the

merge between A and B (i.e., a group of nodes with master A or B):*
Assume we don't know when the split-brain happens and both node groups
have updated their data to some extends:
- If A and B have exclusive data separately, all data will be merged
successfully
- If A and B have the same record id but different record value (due
to update), ES cannot merge the data and the system is hanging there (aka.
split-brain effect)

Are you saying that case 1 is handled automatically?

*For the 2nd case, is it possible to add a customized merging strategy
in ES? Say, if having the same record id but different record value, we
take the record with the latest timestamp. *
By this means, I believe we will have less impact from split-brain.
Can we do that? Or will it be added to ES roadmap.

I would add a second up-vote to this request.

In the Oracle world of replication, consider two updates, each to the
same record but in a separate node in a replicated cluster. If one update
modifies field A and the other modifies field B, then the most recent
update wins and the previous one's changes are lost. In other words, the
end result of cross-node replication is that either field B's updates are
saved or field A's updates are saved, but not both. Our solution was to
direct all clients to point to one of the Oracle nodes and let replication
flow in only one direction; fail-over means those applications would need
to be re-pointed. Oracle did nothing to help us; it was all up to us.

So your suggestion in the 2nd case makes a lot of sense. No, it's not
perfect. Yes, there can be data loss. Oracle buys palatial headquarters
buildingshttp://media3.s-nbcnews.com/j/MSNBC/Components/Photo/2009/April/090416/090420-sun-oracle-hmed-4p.grid-6x2.jpg,
racing yachtshttp://yachtingworld.media.ipcdigital.co.uk/9097/000007e54/d554/AC34SFJune15-0900.jpg,
and very nice private jetshttp://www.oracleprivatejets.com/images/opjsceptercard.jpgwith their data loss replication, so their replication strategy can't be
all bad! As with the recent additions to the version types to ES
1.1 with the appropriate warnings, the 2nd case as you describe could be
implemented along with its own warnings about exposure to data loss; an
exposure that a use could work around as needed but with their eyes open.

Brian

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a229260e-bc27-41be-9ed3-91bfa2bc11a3%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/a229260e-bc27-41be-9ed3-91bfa2bc11a3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b2986ab9-d853-44b2-bb41-22991bdee2c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · April 16, 2014, 11:55pm

I have no idea, but here is a recent comment:

github.com/elastic/elasticsearch

minimum_master_nodes does not prevent split-brain if splits are intersecting

opened 08:15AM - 17 Dec 12 UTC

closed 02:58PM - 01 Sep 14 UTC

saj

>bug v2.0.0-beta1 v1.4.0.Beta1

G'day, I'm using ElasticSearch 0.19.11 with the unicast Zen discovery protocol.… With this setup, I can easily split a 3-node cluster into two 'hemispheres' (continuing with the brain metaphor) with one node acting as a participant in both hemispheres. I believe this to be a significant problem, because now `minimum_master_nodes` is incapable of preventing certain split-brain scenarios. Here's what my 3-node test cluster looked like before I broke it: ![](https://saj.beta.anchortrove.com/es-splitbrain-1.png) Here's what the cluster looked like after simulating a communications failure between nodes (2) and (3): ![](https://saj.beta.anchortrove.com/es-splitbrain-2.png) Here's what seems to have happened immediately after the split: 1. Node (2) and (3) lose contact with one another. (`zen-disco-node_failed` ... `reason failed to ping`) 2. Node (2), still master of the left hemisphere, notes the disappearance of node (3) and broadcasts an advisory message to all of its followers. Node (1) takes note of the advisory. 3. Node (3) has now lost contact with its old master and decides to hold an election. It declares itself winner of the election. On declaring itself, it assumes master role of the right hemisphere, then broadcasts an advisory message to all of its followers. Node (1) takes note of this advisory, too. At this point, I can't say I know what to expect to find on node (1). If I query both masters for a list of nodes, I see node (1) in both clusters. Let's look at `minimum_master_nodes` as it applies to this test cluster. Assume I had set `minimum_master_nodes` to 2. Had node (3) been completely isolated from nodes (1) and (2), I would not have run into this problem. The left hemisphere would have enough nodes to satisfy the constraint; the right hemisphere would not. This would continue to work for larger clusters (with an appropriately larger value for `minimum_master_nodes`). The problem with `minimum_master_nodes` is that it does not work when the split brains are intersecting, as in my example above. Even on a larger cluster of, say, 7 nodes with `minimum_master_nodes` set to 4, all that needs to happen is for the 'right' two nodes to lose contact with one another (a master election has to take place) for the cluster to split. Is there anything that can be done to detect the intersecting split on node (1)? Would #1057 help? Am I missing something obvious? :)

--
Ivan

On Wed, Apr 16, 2014 at 4:41 PM, Jing Liu jliu@ciphercloud.com wrote:

Thanks Ivan for your response.
Is it possible to know when the new solution will come out? ES 1.2?

Thanks,
Jing

On Wednesday, April 16, 2014 4:30:15 PM UTC-7, Ivan Brusic wrote:

I believe that the Elasticsearch team is more focused on eliminating
split-brain than the after effects of a split brain. Recent comments
indicate that they are actively working on a solution.

The new consensus algorithm (Paxos/RAFT?) will undoubtedly affect how
conflicts are reconciled.

Cheers,

Ivan

On Wed, Apr 16, 2014 at 11:14 AM, Jing Liu jl...@ciphercloud.com wrote:

Hi Brain,

Thanks for your inputs.
Yes, the above two cases are found during our tests. Case 1 will be
handled automatically. Hopefully could get attention from ES team for the
case 2 solution.

Jing

On Wednesday, April 16, 2014 6:43:14 AM UTC-7, InquiringMind wrote:

Jing,

I don't have much experience with ES in a production cluster
environment; all my experience has been with the Java API for mapping, bulk
load, and query logic, and with huge databases and things like that. But my
3-node test ES cluster has gathered some dust over the past few months as
other tasks have loomed (most good; it's just a matter of time and
priority). So your question really intrigued me.

*When split-brain occurs, I found following behaviors on ES during the

merge between A and B (i.e., a group of nodes with master A or B):*
Assume we don't know when the split-brain happens and both node
groups have updated their data to some extends:
- If A and B have exclusive data separately, all data will be merged
successfully
- If A and B have the same record id but different record value (due
to update), ES cannot merge the data and the system is hanging there (aka.
split-brain effect)

Are you saying that case 1 is handled automatically?

*For the 2nd case, is it possible to add a customized merging strategy
in ES? Say, if having the same record id but different record value, we
take the record with the latest timestamp. *
By this means, I believe we will have less impact from split-brain.
Can we do that? Or will it be added to ES roadmap.

I would add a second up-vote to this request.

In the Oracle world of replication, consider two updates, each to the
same record but in a separate node in a replicated cluster. If one update
modifies field A and the other modifies field B, then the most recent
update wins and the previous one's changes are lost. In other words, the
end result of cross-node replication is that either field B's updates are
saved or field A's updates are saved, but not both. Our solution was to
direct all clients to point to one of the Oracle nodes and let replication
flow in only one direction; fail-over means those applications would need
to be re-pointed. Oracle did nothing to help us; it was all up to us.

So your suggestion in the 2nd case makes a lot of sense. No, it's not
perfect. Yes, there can be data loss. Oracle buys palatial
headquarters buildingshttp://media3.s-nbcnews.com/j/MSNBC/Components/Photo/2009/April/090416/090420-sun-oracle-hmed-4p.grid-6x2.jpg,
racing yachtshttp://yachtingworld.media.ipcdigital.co.uk/9097/000007e54/d554/AC34SFJune15-0900.jpg,
and very nice private jetshttp://www.oracleprivatejets.com/images/opjsceptercard.jpgwith their data loss replication, so their replication strategy can't be
all bad! As with the recent additions to the version types to ES
1.1 with the appropriate warnings, the 2nd case as you describe could be
implemented along with its own warnings about exposure to data loss; an
exposure that a use could work around as needed but with their eyes open.

Brian

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/a229260e-bc27-41be-9ed3-91bfa2bc11a3%
40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/a229260e-bc27-41be-9ed3-91bfa2bc11a3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b2986ab9-d853-44b2-bb41-22991bdee2c9%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/b2986ab9-d853-44b2-bb41-22991bdee2c9%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%2Bfq6CDvk8WJU8thHJhfYurMohe%2B3QQ3pJWUBJ92ESXw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.