Fault Tolerance Fallacy

Ken_Edwards · January 23, 2014, 11:49pm

A few months ago when my team was deciding which search technology to use,
one of the primary reasons why we chose Elasticsearch over Solr was because
it treats distributed search as a top-level priority. Since then I been
delighted about the rich feature set and powerful search capabilities it
provides far beyond my original expectations. Unfortunately, I now believe
it is deficient in the area which appealed most to me -- providing a fault
tolerant distributed search solution.

Specifically, there is no attractive solution for reliably avoiding the
split brain ("mix brain") problem that others have described much better
than I can [1] https://github.com/elasticsearch/elasticsearch/issues/2488.
In summary, even appropriately setting minimum_master_nodes to a quorum of
the entire cluster a split brain scenario can occur. This is not just a
theoretical problem: there reports of this happening in a multitude of
production deployments. This has been a known issue for years
[2]https://github.com/elasticsearch/elasticsearch/issues/2117,
and while this was once partially addressed
[3]https://github.com/elasticsearch/elasticsearch/issues/2117 it
is still a source of concern pain for many
[4]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/2aS57YkYObk/7BZlqYOV-QMJ
[5]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/VijFeGDYA_E/d9FshM9ax1UJ
[6]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/HfbdBvi91dA/ZuXK8MlquG0J
[7]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/EMsZcgoSRbw/aPKH0DT5hn0J
[8]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/LJyY0z5TFww/6Smcb5dwXZgJ
.

The Zookeeper plugin [9]
https://github.com/sonian/elasticsearch-zookeeperis the most
recommended workaround, however this is an inadequate solution.
First, the requirement to maintain a Zookeeper is a deterrent for some and
it is directly in opposition of the desire to provide a reasonable default
solution with minimal configuration. Worse, this plugin appears to no
longer be maintained has been broken for at least the past four months
since the release of 0.90.4
[10]https://github.com/elasticsearch/elasticsearch/issues/4211
[11] https://github.com/sonian/elasticsearch-zookeeper/issues/19. Igor
has proposed what appears to only be a partial solution
[12]https://github.com/imotov/elasticsearch-zookeeper/commit/c79ec1415b43e82bc5939659e02bbdf9e479a53f,
and as far as I can tell it has not received much traction. This forces
developers to make the choice to risk the possibility of a serious data
corruption issue or run an Elasticsearch cluster unable to take advantage
of ongoing improvements.

Given the severity and widespread nature of this problem, I am surprised by
how long it has been allowed to persist in light of Elasticsearch's stated
goals [13] http://www.elasticsearch.org/overview/. System stability is
paramount to Elasticsearch's success, and I hope for a viable solution to
address the aforementioned problems. Are there any plans to do so?

Kind Regards,
Ken Edwards

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CABPQFqfzbTP%2B5n7JocDqTzw5fv_Km0DOoWqLUgyLPvTmLpq90g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Holger_Hoffstatte · January 24, 2014, 12:14pm

On Thu, 23 Jan 2014 15:49:05 -0800, Ken Edwards wrote:

[..snip "cluster" problems/split brain resilience etc..]

Given the severity and widespread nature of this problem, I am surprised by
how long it has been allowed to persist in light of Elasticsearch's stated
goals [13] http://www.elasticsearch.org/overview/. System stability is
paramount to Elasticsearch's success, and I hope for a viable solution to
address the aforementioned problems. Are there any plans to do so?

Please do not take this as an "official" statement, but I do have it
on good authority that the importance of these problems is well understood,
and that the problem domain will very soon get all the attention it
deserves. And then some.

-h

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/pan.2014.01.24.12.15.09%40googlemail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · January 25, 2014, 6:26pm

Unfortunately the problem will still not be addressed in the upcoming 1.0
release. Judging by recent comments from the elasticsearch team, they are
truly looking into the matter. The change will require changing the
underlying consensus algorithm to something like Paxos/Raft.

That said, the current solution would be to avoid situations that could
lead into split brain scenarios in the first place. (duh! :)) The common
reasons for a node leaving a cluster are: network issues, large GC, and OOM
errors. There is not much that can be done in elasticsearch for network
issues except for increasing the various timeout values. Each new version
of elasticsearch improves the memory situation. Recent versions should also
be compatible with G1 GC. Fine-tuning your queries and JVM settings. The
current consensus algorithm is frustrating since it is apparent that the
leader election is broken, although I have not looked at it in 1.0.

Cheers,

Ivan

On Thu, Jan 23, 2014 at 3:49 PM, Ken Edwards ken@hyk.me wrote:

A few months ago when my team was deciding which search technology to use,
one of the primary reasons why we chose Elasticsearch over Solr was because
it treats distributed search as a top-level priority. Since then I been
delighted about the rich feature set and powerful search capabilities it
provides far beyond my original expectations. Unfortunately, I now believe
it is deficient in the area which appealed most to me -- providing a fault
tolerant distributed search solution.

Specifically, there is no attractive solution for reliably avoiding the
split brain ("mix brain") problem that others have described much better
than I can [1]https://github.com/elasticsearch/elasticsearch/issues/2488.
In summary, even appropriately setting minimum_master_nodes to a quorum of
the entire cluster a split brain scenario can occur. This is not just a
theoretical problem: there reports of this happening in a multitude of
production deployments. This has been a known issue for years [2]https://github.com/elasticsearch/elasticsearch/issues/2117,
and while this was once partially addressed [3]https://github.com/elasticsearch/elasticsearch/issues/2117 it
is still a source of concern pain for many [4]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/2aS57YkYObk/7BZlqYOV-QMJ
[5]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/VijFeGDYA_E/d9FshM9ax1UJ
[6]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/HfbdBvi91dA/ZuXK8MlquG0J
[7]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/EMsZcgoSRbw/aPKH0DT5hn0J
[8]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/LJyY0z5TFww/6Smcb5dwXZgJ
.

The Zookeeper plugin [9]https://github.com/sonian/elasticsearch-zookeeperis the most recommended workaround, however this is an inadequate solution.
First, the requirement to maintain a Zookeeper is a deterrent for some and
it is directly in opposition of the desire to provide a reasonable default
solution with minimal configuration. Worse, this plugin appears to no
longer be maintained has been broken for at least the past four months
since the release of 0.90.4 [10]https://github.com/elasticsearch/elasticsearch/issues/4211
[11] https://github.com/sonian/elasticsearch-zookeeper/issues/19. Igor
has proposed what appears to only be a partial solution [12]https://github.com/imotov/elasticsearch-zookeeper/commit/c79ec1415b43e82bc5939659e02bbdf9e479a53f,
and as far as I can tell it has not received much traction. This forces
developers to make the choice to risk the possibility of a serious data
corruption issue or run an Elasticsearch cluster unable to take advantage
of ongoing improvements.

Given the severity and widespread nature of this problem, I am surprised
by how long it has been allowed to persist in light of Elasticsearch's
stated goals [13] http://www.elasticsearch.org/overview/. System
stability is paramount to Elasticsearch's success, and I hope for a viable
solution to address the aforementioned problems. Are there any plans to do
so?

Kind Regards,
Ken Edwards

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CABPQFqfzbTP%2B5n7JocDqTzw5fv_Km0DOoWqLUgyLPvTmLpq90g%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDv7hEE5sjb9gkzg-bwgC_imThz9n9Zu5JuXkSg53utRg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · January 25, 2014, 7:03pm

I don't think the challenge for network disruptions are the same like for
node failures. Network disruptions can result in different and even
contradictory cluster views while all nodes stay intact, and the challenge
is that all nodes do not agree on one view. Consensus on a quorum or
timeouts often do not suffice in that situations.

In a leader election consensus protocol like RAFT, an effective resistance
against multi leader election can be developed. Zen discovery leader
election seems very close to RAFT leader election but still needs some work.

A presentation of RAFT (including leader election) can be found at

Jörg

On Sat, Jan 25, 2014 at 7:26 PM, Ivan Brusic ivan@brusic.com wrote:

Unfortunately the problem will still not be addressed in the upcoming 1.0
release. Judging by recent comments from the elasticsearch team, they are
truly looking into the matter. The change will require changing the
underlying consensus algorithm to something like Paxos/Raft.

That said, the current solution would be to avoid situations that could
lead into split brain scenarios in the first place. (duh! :)) The common
reasons for a node leaving a cluster are: network issues, large GC, and OOM
errors. There is not much that can be done in elasticsearch for network
issues except for increasing the various timeout values. Each new version
of elasticsearch improves the memory situation. Recent versions should also
be compatible with G1 GC. Fine-tuning your queries and JVM settings. The
current consensus algorithm is frustrating since it is apparent that the
leader election is broken, although I have not looked at it in 1.0.

Cheers,

Ivan

On Thu, Jan 23, 2014 at 3:49 PM, Ken Edwards ken@hyk.me wrote:

A few months ago when my team was deciding which search technology to
use, one of the primary reasons why we chose Elasticsearch over Solr was
because it treats distributed search as a top-level priority. Since then I
been delighted about the rich feature set and powerful search capabilities
it provides far beyond my original expectations. Unfortunately, I now
believe it is deficient in the area which appealed most to me -- providing
a fault tolerant distributed search solution.

Specifically, there is no attractive solution for reliably avoiding the
split brain ("mix brain") problem that others have described much better
than I can [1]https://github.com/elasticsearch/elasticsearch/issues/2488.
In summary, even appropriately setting minimum_master_nodes to a quorum of
the entire cluster a split brain scenario can occur. This is not just a
theoretical problem: there reports of this happening in a multitude of
production deployments. This has been a known issue for years [2]https://github.com/elasticsearch/elasticsearch/issues/2117,
and while this was once partially addressed [3]https://github.com/elasticsearch/elasticsearch/issues/2117 it
is still a source of concern pain for many [4]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/2aS57YkYObk/7BZlqYOV-QMJ
[5]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/VijFeGDYA_E/d9FshM9ax1UJ
[6]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/HfbdBvi91dA/ZuXK8MlquG0J
[7]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/EMsZcgoSRbw/aPKH0DT5hn0J
[8]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/LJyY0z5TFww/6Smcb5dwXZgJ
.

The Zookeeper plugin [9]https://github.com/sonian/elasticsearch-zookeeperis the most recommended workaround, however this is an inadequate solution.
First, the requirement to maintain a Zookeeper is a deterrent for some and
it is directly in opposition of the desire to provide a reasonable default
solution with minimal configuration. Worse, this plugin appears to no
longer be maintained has been broken for at least the past four months
since the release of 0.90.4 [10]https://github.com/elasticsearch/elasticsearch/issues/4211
[11] https://github.com/sonian/elasticsearch-zookeeper/issues/19. Igor
has proposed what appears to only be a partial solution [12]https://github.com/imotov/elasticsearch-zookeeper/commit/c79ec1415b43e82bc5939659e02bbdf9e479a53f,
and as far as I can tell it has not received much traction. This forces
developers to make the choice to risk the possibility of a serious data
corruption issue or run an Elasticsearch cluster unable to take advantage
of ongoing improvements.

Given the severity and widespread nature of this problem, I am surprised
by how long it has been allowed to persist in light of Elasticsearch's
stated goals [13] http://www.elasticsearch.org/overview/. System
stability is paramount to Elasticsearch's success, and I hope for a viable
solution to address the aforementioned problems. Are there any plans to do
so?

Kind Regards,
Ken Edwards

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CABPQFqfzbTP%2B5n7JocDqTzw5fv_Km0DOoWqLUgyLPvTmLpq90g%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDv7hEE5sjb9gkzg-bwgC_imThz9n9Zu5JuXkSg53utRg%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFqrwWzx1NqE-qtmFVGUxFBZJRZWQs3krrmkiWoTwiehw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · January 27, 2014, 7:30am

Jörg, I am not sure if you were responding to Ken's post or my response,
but I do agree with you. Network disruptions are hard to plan for and there
is little tuning that can be done. The current leader election process
could be improved however.

The elasticsearch team did say they are looking into other consensus
algorithms such as Paxos and Raft, but have not made a decision. I read the
original Paxos paper (a long time ago) and recently the Raft paper [1], and
the main selling point of Raft is not that it works better, just that it is
easier to understand. I would think that such a change would be in the
1.0 release, but I am glad that it is being planned.

[1]
https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf

Ivan

On Sat, Jan 25, 2014 at 11:03 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

I don't think the challenge for network disruptions are the same like for
node failures. Network disruptions can result in different and even
contradictory cluster views while all nodes stay intact, and the challenge
is that all nodes do not agree on one view. Consensus on a quorum or
timeouts often do not suffice in that situations.

In a leader election consensus protocol like RAFT, an effective resistance
against multi leader election can be developed. Zen discovery leader
election seems very close to RAFT leader election but still needs some work.

A presentation of RAFT (including leader election) can be found at

https://speakerdeck.com/benbjohnson/raft-the-understandable-distributed-consensus-protocol

Jörg

On Sat, Jan 25, 2014 at 7:26 PM, Ivan Brusic ivan@brusic.com wrote:

Unfortunately the problem will still not be addressed in the upcoming 1.0
release. Judging by recent comments from the elasticsearch team, they are
truly looking into the matter. The change will require changing the
underlying consensus algorithm to something like Paxos/Raft.

That said, the current solution would be to avoid situations that could
lead into split brain scenarios in the first place. (duh! :)) The common
reasons for a node leaving a cluster are: network issues, large GC, and OOM
errors. There is not much that can be done in elasticsearch for network
issues except for increasing the various timeout values. Each new version
of elasticsearch improves the memory situation. Recent versions should also
be compatible with G1 GC. Fine-tuning your queries and JVM settings. The
current consensus algorithm is frustrating since it is apparent that the
leader election is broken, although I have not looked at it in 1.0.

Cheers,

Ivan

On Thu, Jan 23, 2014 at 3:49 PM, Ken Edwards ken@hyk.me wrote:

A few months ago when my team was deciding which search technology to
use, one of the primary reasons why we chose Elasticsearch over Solr was
because it treats distributed search as a top-level priority. Since then I
been delighted about the rich feature set and powerful search capabilities
it provides far beyond my original expectations. Unfortunately, I now
believe it is deficient in the area which appealed most to me -- providing
a fault tolerant distributed search solution.

Specifically, there is no attractive solution for reliably avoiding the
split brain ("mix brain") problem that others have described much better
than I can [1]https://github.com/elasticsearch/elasticsearch/issues/2488.
In summary, even appropriately setting minimum_master_nodes to a quorum of
the entire cluster a split brain scenario can occur. This is not just a
theoretical problem: there reports of this happening in a multitude of
production deployments. This has been a known issue for years [2]https://github.com/elasticsearch/elasticsearch/issues/2117,
and while this was once partially addressed [3]https://github.com/elasticsearch/elasticsearch/issues/2117 it
is still a source of concern pain for many [4]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/2aS57YkYObk/7BZlqYOV-QMJ
[5]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/VijFeGDYA_E/d9FshM9ax1UJ
[6]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/HfbdBvi91dA/ZuXK8MlquG0J
[7]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/EMsZcgoSRbw/aPKH0DT5hn0J
[8]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/LJyY0z5TFww/6Smcb5dwXZgJ
.

The Zookeeper plugin [9]https://github.com/sonian/elasticsearch-zookeeperis the most recommended workaround, however this is an inadequate solution.
First, the requirement to maintain a Zookeeper is a deterrent for some and
it is directly in opposition of the desire to provide a reasonable default
solution with minimal configuration. Worse, this plugin appears to no
longer be maintained has been broken for at least the past four months
since the release of 0.90.4 [10]https://github.com/elasticsearch/elasticsearch/issues/4211
[11] https://github.com/sonian/elasticsearch-zookeeper/issues/19.
Igor has proposed what appears to only be a partial solution [12]https://github.com/imotov/elasticsearch-zookeeper/commit/c79ec1415b43e82bc5939659e02bbdf9e479a53f,
and as far as I can tell it has not received much traction. This forces
developers to make the choice to risk the possibility of a serious data
corruption issue or run an Elasticsearch cluster unable to take advantage
of ongoing improvements.

Given the severity and widespread nature of this problem, I am surprised
by how long it has been allowed to persist in light of Elasticsearch's
stated goals [13] http://www.elasticsearch.org/overview/. System
stability is paramount to Elasticsearch's success, and I hope for a viable
solution to address the aforementioned problems. Are there any plans to do
so?

Kind Regards,
Ken Edwards

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CABPQFqfzbTP%2B5n7JocDqTzw5fv_Km0DOoWqLUgyLPvTmLpq90g%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDv7hEE5sjb9gkzg-bwgC_imThz9n9Zu5JuXkSg53utRg%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFqrwWzx1NqE-qtmFVGUxFBZJRZWQs3krrmkiWoTwiehw%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCHDjr10sKMmjQeNcXBYqj2cE1RNz8B9MVNvN9UcohD2A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

otisg · January 27, 2014, 7:15pm

FYI, from Shay: https://twitter.com/otisg/status/427866316444553216

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Thursday, January 23, 2014 6:49:05 PM UTC-5, Ken Edwards wrote:

A few months ago when my team was deciding which search technology to use,
one of the primary reasons why we chose Elasticsearch over Solr was because
it treats distributed search as a top-level priority. Since then I been
delighted about the rich feature set and powerful search capabilities it
provides far beyond my original expectations. Unfortunately, I now believe
it is deficient in the area which appealed most to me -- providing a fault
tolerant distributed search solution.

Specifically, there is no attractive solution for reliably avoiding the
split brain ("mix brain") problem that others have described much better
than I can [1]https://github.com/elasticsearch/elasticsearch/issues/2488.
In summary, even appropriately setting minimum_master_nodes to a quorum of
the entire cluster a split brain scenario can occur. This is not just a
theoretical problem: there reports of this happening in a multitude of
production deployments. This has been a known issue for years [2]https://github.com/elasticsearch/elasticsearch/issues/2117,
and while this was once partially addressed [3]https://github.com/elasticsearch/elasticsearch/issues/2117 it
is still a source of concern pain for many [4]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/2aS57YkYObk/7BZlqYOV-QMJ
[5]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/VijFeGDYA_E/d9FshM9ax1UJ
[6]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/HfbdBvi91dA/ZuXK8MlquG0J
[7]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/EMsZcgoSRbw/aPKH0DT5hn0J
[8]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/LJyY0z5TFww/6Smcb5dwXZgJ
.

The Zookeeper plugin [9]https://github.com/sonian/elasticsearch-zookeeperis the most recommended workaround, however this is an inadequate solution.
First, the requirement to maintain a Zookeeper is a deterrent for some and
it is directly in opposition of the desire to provide a reasonable default
solution with minimal configuration. Worse, this plugin appears to no
longer be maintained has been broken for at least the past four months
since the release of 0.90.4 [10]https://github.com/elasticsearch/elasticsearch/issues/4211
[11] https://github.com/sonian/elasticsearch-zookeeper/issues/19. Igor
has proposed what appears to only be a partial solution [12]https://github.com/imotov/elasticsearch-zookeeper/commit/c79ec1415b43e82bc5939659e02bbdf9e479a53f,
and as far as I can tell it has not received much traction. This forces
developers to make the choice to risk the possibility of a serious data
corruption issue or run an Elasticsearch cluster unable to take advantage
of ongoing improvements.

Given the severity and widespread nature of this problem, I am surprised
by how long it has been allowed to persist in light of Elasticsearch's
stated goals [13] http://www.elasticsearch.org/overview/. System
stability is paramount to Elasticsearch's success, and I hope for a viable
solution to address the aforementioned problems. Are there any plans to do
so?

Kind Regards,
Ken Edwards

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/740aa11c-847a-47e2-b556-39dbbceafa9e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · February 5, 2014, 7:58pm

Otis, you can listen to the comment here at 1:04:20:
http://player.vimeo.com/video/85255909

On Mon, Jan 27, 2014 at 11:15 AM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

FYI, from Shay: https://twitter.com/otisg/status/427866316444553216

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Thursday, January 23, 2014 6:49:05 PM UTC-5, Ken Edwards wrote:

A few months ago when my team was deciding which search technology to
use, one of the primary reasons why we chose Elasticsearch over Solr was
because it treats distributed search as a top-level priority. Since then I
been delighted about the rich feature set and powerful search capabilities
it provides far beyond my original expectations. Unfortunately, I now
believe it is deficient in the area which appealed most to me -- providing
a fault tolerant distributed search solution.

Specifically, there is no attractive solution for reliably avoiding the
split brain ("mix brain") problem that others have described much better
than I can [1]https://github.com/elasticsearch/elasticsearch/issues/2488.
In summary, even appropriately setting minimum_master_nodes to a quorum of
the entire cluster a split brain scenario can occur. This is not just a
theoretical problem: there reports of this happening in a multitude of
production deployments. This has been a known issue for years [2]https://github.com/elasticsearch/elasticsearch/issues/2117,
and while this was once partially addressed [3]https://github.com/elasticsearch/elasticsearch/issues/2117 it
is still a source of concern pain for many [4]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/2aS57YkYObk/7BZlqYOV-QMJ
[5]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/VijFeGDYA_E/d9FshM9ax1UJ
[6]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/HfbdBvi91dA/ZuXK8MlquG0J
[7]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/EMsZcgoSRbw/aPKH0DT5hn0J
[8]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/LJyY0z5TFww/6Smcb5dwXZgJ
.

The Zookeeper plugin [9]https://github.com/sonian/elasticsearch-zookeeperis the most recommended workaround, however this is an inadequate solution.
First, the requirement to maintain a Zookeeper is a deterrent for some and
it is directly in opposition of the desire to provide a reasonable default
solution with minimal configuration. Worse, this plugin appears to no
longer be maintained has been broken for at least the past four months
since the release of 0.90.4 [10]https://github.com/elasticsearch/elasticsearch/issues/4211
[11] https://github.com/sonian/elasticsearch-zookeeper/issues/19. Igor
has proposed what appears to only be a partial solution [12]https://github.com/imotov/elasticsearch-zookeeper/commit/c79ec1415b43e82bc5939659e02bbdf9e479a53f,
and as far as I can tell it has not received much traction. This forces
developers to make the choice to risk the possibility of a serious data
corruption issue or run an Elasticsearch cluster unable to take advantage
of ongoing improvements.

Given the severity and widespread nature of this problem, I am surprised
by how long it has been allowed to persist in light of Elasticsearch's
stated goals [13] http://www.elasticsearch.org/overview/. System
stability is paramount to Elasticsearch's success, and I hope for a viable
solution to address the aforementioned problems. Are there any plans to do
so?

Kind Regards,
Ken Edwards

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/740aa11c-847a-47e2-b556-39dbbceafa9e%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCRZF8CpCrcLD6EOrRqkjL%2BG9QnzYBx9U4Yg3MLtfcGBA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

otisg · February 5, 2014, 8:09pm

Thanks Ivan, but the link doesn't work for me. If this is from the ES NYC
meetup the other day where Shay says they'll implement RAFT or something
similar, I was there and heard that.

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Wed, Feb 5, 2014 at 2:58 PM, Ivan Brusic ivan@brusic.com wrote:

Otis, you can listen to the comment here at 1:04:20:
http://player.vimeo.com/video/85255909

On Mon, Jan 27, 2014 at 11:15 AM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

FYI, from Shay: https://twitter.com/otisg/status/427866316444553216

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Thursday, January 23, 2014 6:49:05 PM UTC-5, Ken Edwards wrote:

A few months ago when my team was deciding which search technology to
use, one of the primary reasons why we chose Elasticsearch over Solr was
because it treats distributed search as a top-level priority. Since then I
been delighted about the rich feature set and powerful search capabilities
it provides far beyond my original expectations. Unfortunately, I now
believe it is deficient in the area which appealed most to me -- providing
a fault tolerant distributed search solution.

Specifically, there is no attractive solution for reliably avoiding the
split brain ("mix brain") problem that others have described much better
than I can [1]https://github.com/elasticsearch/elasticsearch/issues/2488.
In summary, even appropriately setting minimum_master_nodes to a quorum of
the entire cluster a split brain scenario can occur. This is not just a
theoretical problem: there reports of this happening in a multitude of
production deployments. This has been a known issue for years [2]https://github.com/elasticsearch/elasticsearch/issues/2117,
and while this was once partially addressed [3]https://github.com/elasticsearch/elasticsearch/issues/2117 it
is still a source of concern pain for many [4]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/2aS57YkYObk/7BZlqYOV-QMJ
[5]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/VijFeGDYA_E/d9FshM9ax1UJ
[6]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/HfbdBvi91dA/ZuXK8MlquG0J
[7]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/EMsZcgoSRbw/aPKH0DT5hn0J
[8]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/LJyY0z5TFww/6Smcb5dwXZgJ
.

The Zookeeper plugin [9]https://github.com/sonian/elasticsearch-zookeeperis the most recommended workaround, however this is an inadequate solution.
First, the requirement to maintain a Zookeeper is a deterrent for some and
it is directly in opposition of the desire to provide a reasonable default
solution with minimal configuration. Worse, this plugin appears to no
longer be maintained has been broken for at least the past four months
since the release of 0.90.4 [10]https://github.com/elasticsearch/elasticsearch/issues/4211
[11] https://github.com/sonian/elasticsearch-zookeeper/issues/19.
Igor has proposed what appears to only be a partial solution [12]https://github.com/imotov/elasticsearch-zookeeper/commit/c79ec1415b43e82bc5939659e02bbdf9e479a53f,
and as far as I can tell it has not received much traction. This forces
developers to make the choice to risk the possibility of a serious data
corruption issue or run an Elasticsearch cluster unable to take advantage
of ongoing improvements.

Given the severity and widespread nature of this problem, I am surprised
by how long it has been allowed to persist in light of Elasticsearch's
stated goals [13] http://www.elasticsearch.org/overview/. System
stability is paramount to Elasticsearch's success, and I hope for a viable
solution to address the aforementioned problems. Are there any plans to do
so?

Kind Regards,
Ken Edwards

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/740aa11c-847a-47e2-b556-39dbbceafa9e%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/45NXRITV5fM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCRZF8CpCrcLD6EOrRqkjL%2BG9QnzYBx9U4Yg3MLtfcGBA%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANNBgPLUoAVzuY6mSzXXZGSybLmv8FjykqynD8q%3DvgyZ1YV79Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · February 5, 2014, 8:17pm

Different link:

On Wed, Feb 5, 2014 at 12:09 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

Thanks Ivan, but the link doesn't work for me. If this is from the ES NYC
meetup the other day where Shay says they'll implement RAFT or something
similar, I was there and heard that.

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Wed, Feb 5, 2014 at 2:58 PM, Ivan Brusic ivan@brusic.com wrote:

Otis, you can listen to the comment here at 1:04:20:
http://player.vimeo.com/video/85255909

On Mon, Jan 27, 2014 at 11:15 AM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

FYI, from Shay: https://twitter.com/otisg/status/427866316444553216

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Thursday, January 23, 2014 6:49:05 PM UTC-5, Ken Edwards wrote:

A few months ago when my team was deciding which search technology to
use, one of the primary reasons why we chose Elasticsearch over Solr was
because it treats distributed search as a top-level priority. Since then I
been delighted about the rich feature set and powerful search capabilities
it provides far beyond my original expectations. Unfortunately, I now
believe it is deficient in the area which appealed most to me -- providing
a fault tolerant distributed search solution.

Specifically, there is no attractive solution for reliably avoiding the
split brain ("mix brain") problem that others have described much better
than I can [1]https://github.com/elasticsearch/elasticsearch/issues/2488.
In summary, even appropriately setting minimum_master_nodes to a quorum of
the entire cluster a split brain scenario can occur. This is not just a
theoretical problem: there reports of this happening in a multitude of
production deployments. This has been a known issue for years [2]https://github.com/elasticsearch/elasticsearch/issues/2117,
and while this was once partially addressed [3]https://github.com/elasticsearch/elasticsearch/issues/2117 it
is still a source of concern pain for many [4]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/2aS57YkYObk/7BZlqYOV-QMJ
[5]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/VijFeGDYA_E/d9FshM9ax1UJ
[6]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/HfbdBvi91dA/ZuXK8MlquG0J
[7]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/EMsZcgoSRbw/aPKH0DT5hn0J
[8]https://groups.google.com/forum/#!searchin/elasticsearch/zookeeper|sort:date/elasticsearch/LJyY0z5TFww/6Smcb5dwXZgJ
.

The Zookeeper plugin [9]https://github.com/sonian/elasticsearch-zookeeperis the most recommended workaround, however this is an inadequate solution.
First, the requirement to maintain a Zookeeper is a deterrent for some and
it is directly in opposition of the desire to provide a reasonable default
solution with minimal configuration. Worse, this plugin appears to no
longer be maintained has been broken for at least the past four months
since the release of 0.90.4 [10]https://github.com/elasticsearch/elasticsearch/issues/4211
[11] https://github.com/sonian/elasticsearch-zookeeper/issues/19.
Igor has proposed what appears to only be a partial solution [12]https://github.com/imotov/elasticsearch-zookeeper/commit/c79ec1415b43e82bc5939659e02bbdf9e479a53f,
and as far as I can tell it has not received much traction. This forces
developers to make the choice to risk the possibility of a serious data
corruption issue or run an Elasticsearch cluster unable to take advantage
of ongoing improvements.

Given the severity and widespread nature of this problem, I am
surprised by how long it has been allowed to persist in light of
Elasticsearch's stated goals [13]http://www.elasticsearch.org/overview/.
System stability is paramount to Elasticsearch's success, and I hope for a
viable solution to address the aforementioned problems. Are there any plans
to do so?

Kind Regards,
Ken Edwards

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/740aa11c-847a-47e2-b556-39dbbceafa9e%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/45NXRITV5fM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCRZF8CpCrcLD6EOrRqkjL%2BG9QnzYBx9U4Yg3MLtfcGBA%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANNBgPLUoAVzuY6mSzXXZGSybLmv8FjykqynD8q%3DvgyZ1YV79Q%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQA%3DL35tU%2BgFk7%2BE-6L7620xNuX5QK9tSmZCDscVT4LZAQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Elasticsearch and the CAP theorem Elasticsearch	10	16171	July 5, 2017
Multi-datacenter and issue 2448 Elasticsearch	1	321	July 6, 2017
Split Brain Problem Elasticsearch	2	605	July 6, 2017
Elasticsearch two node cluster Elasticsearch	4	892	June 8, 2020
Elasticsearch: 2-node cluster with failover Elasticsearch	1	378	July 6, 2017

Fault Tolerance Fallacy

Otis

Otis

Otis

Otis

Otis

Otis

Related topics