Partial index replication causes data loss?

Evan_Tahler · October 14, 2014, 10:52pm

Hi Mailing List! I'm a first-time poster, and a long time reader.

We recently had a crash in our ES (1.3.1 on Ubuntu) cluster which caused us
to loose a significant volume of data. I have a "theory" on what happened
to cause this, and I would love to hear your opinions on this, and if you
have any suggestions to mitigate it.

Here is a simplified play-by-play:

Cluster has 3 data nodes, A, B, and C. The index has 10 shards. The
index has a replica count of 1, so A is the master and B is a replica. C
is doing nothing. Re-allocation of indexes/shards is enabled.
A crashes. B takes over as master, and then starts transferring data
to C as a new replica.
B crashes. C is now master with an impartial dataset.
There is a write to the index.
A and B finally reboot, and they are told that they are now stale (as
C had a write while they were away). Both A and B delete their local data.
A is chosen to be the new replica and re-sync from C.
... all the data A and B had which C never got is lost forever.

Is the above situation scenario possible? If it is, it seems like the
default behavior of ES might be better to not reallocate in this scenario?
This would have caused the write in step #4 to fail, but in our use case,
that is preferable to data loss.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/84ba332b-2e34-4ce4-aaa2-acfa616f3230%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Evan_Tahler · October 23, 2014, 2:18pm

Bump? I would love to hear some thoughts on this flow, and if there are
any suggestions on how to mitigate it (other than replicating all data to
all nodes).

Thanks!

On Tuesday, October 14, 2014 3:52:31 PM UTC-7, Evan Tahler wrote:

Hi Mailing List! I'm a first-time poster, and a long time reader.

We recently had a crash in our ES (1.3.1 on Ubuntu) cluster which caused
us to loose a significant volume of data. I have a "theory" on what
happened to cause this, and I would love to hear your opinions on this, and
if you have any suggestions to mitigate it.

Here is a simplified play-by-play:

Cluster has 3 data nodes, A, B, and C. The index has 10 shards.
The index has a replica count of 1, so A is the master and B is a replica.
C is doing nothing. Re-allocation of indexes/shards is enabled.

A crashes. B takes over as master, and then starts transferring
data to C as a new replica.

B crashes. C is now master with an impartial dataset.

There is a write to the index.

A and B finally reboot, and they are told that they are now stale
(as C had a write while they were away). Both A and B delete their local
data. A is chosen to be the new replica and re-sync from C.

... all the data A and B had which C never got is lost forever.

Is the above situation scenario possible? If it is, it seems like the
default behavior of ES might be better to not reallocate in this scenario?
This would have caused the write in step #4 to fail, but in our use case,
that is preferable to data loss.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58e98223-c036-41e2-b53c-265343fa3173%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

shikhar · October 23, 2014, 7:58pm

Very interesting. The default 'write consistency level' with Elasticsearch
is QUORUM, i.e. verify a quorum of replicas for a shard are available
before processing a write for it. In this case you were just left with 1
replica, C, and a write happened. So you would think that it should not go
through since 2 replicas would be required for quorum. However:
write consistency levels -- quorum of two is two · Issue #6482 · elastic/elasticsearch · GitHub. I think this
goes to show this is a real, not a hypothetical problem!

But guess what? *Even if this were fixed, and a write to C never happened: *it
is still possible that once A & B were back, C could be picked as primary
and clobber data. See:

github.com/elastic/elasticsearch

[Indexing] A network partition can cause in flight documents to be lost

opened 06:01PM - 03 Sep 14 UTC

closed 10:13PM - 06 Apr 16 UTC

bleskes

>bug resiliency :Distributed/Distributed

This ticket is meant to capture an issue which was discovered as part of the wor…k done in #7493 , which contains a [failing reproduction test](https://github.com/elasticsearch/elasticsearch/blob/596a4a073584c4262d574828c9caea35b5ed1de5/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptions.java#L375) with @awaitFix. If a network partition separates a node from the master, there is some window of time before the node detects it. The length of the window is dependent on the type of the partition. This window is extremely small if a socket is broken. More adversarial partitions, for example, silently dropping requests without breaking the socket can take longer (up to 3x30s using current defaults). If the node hosts a _primary_ shard at the moment of partition, and ends up being isolated from the cluster (which could have resulted in Split Brain before), some documents that are being indexed into the primary _may_ be lost if they fail to reach one of the allocated replicas (due to the partition) and that replica is later promoted to primary by the master.

On Thu, Oct 23, 2014 at 7:48 PM, Evan Tahler evantahler@gmail.com wrote:

Bump? I would love to hear some thoughts on this flow, and if there are
any suggestions on how to mitigate it (other than replicating all data to
all nodes).

Thanks!

On Tuesday, October 14, 2014 3:52:31 PM UTC-7, Evan Tahler wrote:

Hi Mailing List! I'm a first-time poster, and a long time reader.

We recently had a crash in our ES (1.3.1 on Ubuntu) cluster which caused
us to loose a significant volume of data. I have a "theory" on what
happened to cause this, and I would love to hear your opinions on this, and
if you have any suggestions to mitigate it.

Here is a simplified play-by-play:

Cluster has 3 data nodes, A, B, and C. The index has 10 shards.
The index has a replica count of 1, so A is the master and B is a replica.
C is doing nothing. Re-allocation of indexes/shards is enabled.

A crashes. B takes over as master, and then starts transferring
data to C as a new replica.

B crashes. C is now master with an impartial dataset.

There is a write to the index.

A and B finally reboot, and they are told that they are now stale
(as C had a write while they were away). Both A and B delete their local
data. A is chosen to be the new replica and re-sync from C.

... all the data A and B had which C never got is lost forever.

Is the above situation scenario possible? If it is, it seems like the
default behavior of ES might be better to not reallocate in this scenario?
This would have caused the write in step #4 to fail, but in our use case,
that is preferable to data loss.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/58e98223-c036-41e2-b53c-265343fa3173%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/58e98223-c036-41e2-b53c-265343fa3173%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHWG4DP848XunJ8_pQKYi36uF2Df1UghZVOwS%2BuzABaocmKKJw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Evan_Tahler · October 23, 2014, 10:27pm

Interesting!

However, the write may not be the cause of the data loss here. Even if
there was no write while A and B are down, would the recovery process have
happened the same way? In some further tests, it still looks like C would
have overwritten all the data in A and B when they rebooted.

This type of error is easily triggered by garbage collection with large
data sets, and a server becoming unresponsive for too long. (perhaps the
cluster kicks out the unresponsive node, or a supervisor restarts the
application)

On Thursday, October 23, 2014 12:59:00 PM UTC-7, Shikhar Bhushan wrote:

Very interesting. The default 'write consistency level' with Elasticsearch
is QUORUM, i.e. verify a quorum of replicas for a shard are available
before processing a write for it. In this case you were just left with 1
replica, C, and a write happened. So you would think that it should not go
through since 2 replicas would be required for quorum. However:
write consistency levels -- quorum of two is two · Issue #6482 · elastic/elasticsearch · GitHub. I think this
goes to show this is a real, not a hypothetical problem!

But guess what? *Even if this were fixed, and a write to C never
happened: *it is still possible that once A & B were back, C could be
picked as primary and clobber data. See:
[Indexing] A network partition can cause in flight documents to be lost · Issue #7572 · elastic/elasticsearch · GitHub

On Thu, Oct 23, 2014 at 7:48 PM, Evan Tahler <evant...@gmail.com
<javascript:>> wrote:

Bump? I would love to hear some thoughts on this flow, and if there are
any suggestions on how to mitigate it (other than replicating all data to
all nodes).

Thanks!

On Tuesday, October 14, 2014 3:52:31 PM UTC-7, Evan Tahler wrote:

Hi Mailing List! I'm a first-time poster, and a long time reader.

We recently had a crash in our ES (1.3.1 on Ubuntu) cluster which caused
us to loose a significant volume of data. I have a "theory" on what
happened to cause this, and I would love to hear your opinions on this, and
if you have any suggestions to mitigate it.

Here is a simplified play-by-play:

Cluster has 3 data nodes, A, B, and C. The index has 10 shards.
The index has a replica count of 1, so A is the master and B is a replica.
C is doing nothing. Re-allocation of indexes/shards is enabled.

A crashes. B takes over as master, and then starts transferring
data to C as a new replica.

B crashes. C is now master with an impartial dataset.

There is a write to the index.

A and B finally reboot, and they are told that they are now stale
(as C had a write while they were away). Both A and B delete their local
data. A is chosen to be the new replica and re-sync from C.

... all the data A and B had which C never got is lost forever.

Is the above situation scenario possible? If it is, it seems like the
default behavior of ES might be better to not reallocate in this scenario?
This would have caused the write in step #4 to fail, but in our use case,
that is preferable to data loss.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/58e98223-c036-41e2-b53c-265343fa3173%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/58e98223-c036-41e2-b53c-265343fa3173%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5b3c6605-da27-4119-8f1b-6fdcf43b404d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

shikhar · October 24, 2014, 6:05am

Yes, this is the 2nd issue I mentioned, where ES will pick basically any
replica as primary without consideration to which one might be more
'up-to-date'

On Fri, Oct 24, 2014 at 3:57 AM, Evan Tahler evantahler@gmail.com wrote:

Interesting!

However, the write may not be the cause of the data loss here. Even if
there was no write while A and B are down, would the recovery process have
happened the same way? In some further tests, it still looks like C would
have overwritten all the data in A and B when they rebooted.

This type of error is easily triggered by garbage collection with large
data sets, and a server becoming unresponsive for too long. (perhaps the
cluster kicks out the unresponsive node, or a supervisor restarts the
application)

On Thursday, October 23, 2014 12:59:00 PM UTC-7, Shikhar Bhushan wrote:

Very interesting. The default 'write consistency level' with
Elasticsearch is QUORUM, i.e. verify a quorum of replicas for a shard are
available before processing a write for it. In this case you were just left
with 1 replica, C, and a write happened. So you would think that it should
not go through since 2 replicas would be required for quorum. However:
write consistency levels -- quorum of two is two · Issue #6482 · elastic/elasticsearch · GitHub. I think this
goes to show this is a real, not a hypothetical problem!

But guess what? *Even if this were fixed, and a write to C never
happened: *it is still possible that once A & B were back, C could be
picked as primary and clobber data. See: https://github.com/
elasticsearch/elasticsearch/issues/7572#issuecomment-59983759

On Thu, Oct 23, 2014 at 7:48 PM, Evan Tahler evant...@gmail.com wrote:

Bump? I would love to hear some thoughts on this flow, and if there are
any suggestions on how to mitigate it (other than replicating all data to
all nodes).

Thanks!

On Tuesday, October 14, 2014 3:52:31 PM UTC-7, Evan Tahler wrote:

Hi Mailing List! I'm a first-time poster, and a long time reader.

We recently had a crash in our ES (1.3.1 on Ubuntu) cluster which
caused us to loose a significant volume of data. I have a "theory" on what
happened to cause this, and I would love to hear your opinions on this, and
if you have any suggestions to mitigate it.

Here is a simplified play-by-play:

Cluster has 3 data nodes, A, B, and C. The index has 10
shards. The index has a replica count of 1, so A is the master and B is a
replica. C is doing nothing. Re-allocation of indexes/shards is enabled.

A crashes. B takes over as master, and then starts transferring
data to C as a new replica.

B crashes. C is now master with an impartial dataset.

There is a write to the index.

A and B finally reboot, and they are told that they are now
stale (as C had a write while they were away). Both A and B delete their
local data. A is chosen to be the new replica and re-sync from C.

... all the data A and B had which C never got is lost forever.

Is the above situation scenario possible? If it is, it seems like the
default behavior of ES might be better to not reallocate in this scenario?
This would have caused the write in step #4 to fail, but in our use case,
that is preferable to data loss.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/58e98223-c036-41e2-b53c-265343fa3173%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/58e98223-c036-41e2-b53c-265343fa3173%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5b3c6605-da27-4119-8f1b-6fdcf43b404d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5b3c6605-da27-4119-8f1b-6fdcf43b404d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHWG4DPL6Amht_M7dOkZH0izkTAZegB-0awROVwDS35eH-aBaw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Evan_Tahler · October 24, 2014, 6:51am

Ahh, thanks for pointing that out!

Lets move this conversation to the Github issue, as I think we can be more
productive there

github.com/elastic/elasticsearch

[Indexing] A network partition can cause in flight documents to be lost

opened 06:01PM - 03 Sep 14 UTC

closed 10:13PM - 06 Apr 16 UTC

bleskes

>bug resiliency :Distributed/Distributed

This ticket is meant to capture an issue which was discovered as part of the wor…k done in #7493 , which contains a [failing reproduction test](https://github.com/elasticsearch/elasticsearch/blob/596a4a073584c4262d574828c9caea35b5ed1de5/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptions.java#L375) with @awaitFix. If a network partition separates a node from the master, there is some window of time before the node detects it. The length of the window is dependent on the type of the partition. This window is extremely small if a socket is broken. More adversarial partitions, for example, silently dropping requests without breaking the socket can take longer (up to 3x30s using current defaults). If the node hosts a _primary_ shard at the moment of partition, and ends up being isolated from the cluster (which could have resulted in Split Brain before), some documents that are being indexed into the primary _may_ be lost if they fail to reach one of the allocated replicas (due to the partition) and that replica is later promoted to primary by the master.

On Thursday, October 23, 2014 11:06:06 PM UTC-7, Shikhar Bhushan wrote:

Yes, this is the 2nd issue I mentioned, where ES will pick basically any
replica as primary without consideration to which one might be more
'up-to-date'

On Fri, Oct 24, 2014 at 3:57 AM, Evan Tahler <evant...@gmail.com
<javascript:>> wrote:

Interesting!

However, the write may not be the cause of the data loss here. Even if
there was no write while A and B are down, would the recovery process have
happened the same way? In some further tests, it still looks like C would
have overwritten all the data in A and B when they rebooted.

This type of error is easily triggered by garbage collection with large
data sets, and a server becoming unresponsive for too long. (perhaps the
cluster kicks out the unresponsive node, or a supervisor restarts the
application)

On Thursday, October 23, 2014 12:59:00 PM UTC-7, Shikhar Bhushan wrote:

Very interesting. The default 'write consistency level' with
Elasticsearch is QUORUM, i.e. verify a quorum of replicas for a shard are
available before processing a write for it. In this case you were just left
with 1 replica, C, and a write happened. So you would think that it should
not go through since 2 replicas would be required for quorum. However:
write consistency levels -- quorum of two is two · Issue #6482 · elastic/elasticsearch · GitHub. I think
this goes to show this is a real, not a hypothetical problem!

But guess what? *Even if this were fixed, and a write to C never
happened: *it is still possible that once A & B were back, C could be
picked as primary and clobber data. See: https://github.com/
elasticsearch/elasticsearch/issues/7572#issuecomment-59983759

On Thu, Oct 23, 2014 at 7:48 PM, Evan Tahler evant...@gmail.com wrote:

Bump? I would love to hear some thoughts on this flow, and if there
are any suggestions on how to mitigate it (other than replicating all data
to all nodes).

Thanks!

On Tuesday, October 14, 2014 3:52:31 PM UTC-7, Evan Tahler wrote:

Hi Mailing List! I'm a first-time poster, and a long time reader.

We recently had a crash in our ES (1.3.1 on Ubuntu) cluster which
caused us to loose a significant volume of data. I have a "theory" on what
happened to cause this, and I would love to hear your opinions on this, and
if you have any suggestions to mitigate it.

Here is a simplified play-by-play:

Cluster has 3 data nodes, A, B, and C. The index has 10
shards. The index has a replica count of 1, so A is the master and B is a
replica. C is doing nothing. Re-allocation of indexes/shards is enabled.

A crashes. B takes over as master, and then starts
transferring data to C as a new replica.

B crashes. C is now master with an impartial dataset.

There is a write to the index.

A and B finally reboot, and they are told that they are now
stale (as C had a write while they were away). Both A and B delete their
local data. A is chosen to be the new replica and re-sync from C.

... all the data A and B had which C never got is lost forever.

Is the above situation scenario possible? If it is, it seems like the
default behavior of ES might be better to not reallocate in this scenario?
This would have caused the write in step #4 to fail, but in our use case,
that is preferable to data loss.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/58e98223-c036-41e2-b53c-265343fa3173%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/58e98223-c036-41e2-b53c-265343fa3173%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5b3c6605-da27-4119-8f1b-6fdcf43b404d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5b3c6605-da27-4119-8f1b-6fdcf43b404d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7a42853d-0892-4c0d-ab72-9874ee390af9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · October 24, 2014, 2:05pm

If you have replica level 1 with 3 nodes, this is not enough. You must set
replica level 2. With replica level 1 and outage of 2 nodes, as you
describe, you will lose data.

Jörg

On Wednesday, October 15, 2014 12:52:31 AM UTC+2, Evan Tahler wrote:

Hi Mailing List! I'm a first-time poster, and a long time reader.

We recently had a crash in our ES (1.3.1 on Ubuntu) cluster which caused
us to loose a significant volume of data. I have a "theory" on what
happened to cause this, and I would love to hear your opinions on this, and
if you have any suggestions to mitigate it.

Here is a simplified play-by-play:

Cluster has 3 data nodes, A, B, and C. The index has 10 shards.
The index has a replica count of 1, so A is the master and B is a replica.
C is doing nothing. Re-allocation of indexes/shards is enabled.

A crashes. B takes over as master, and then starts transferring
data to C as a new replica.

B crashes. C is now master with an impartial dataset.

There is a write to the index.

A and B finally reboot, and they are told that they are now stale
(as C had a write while they were away). Both A and B delete their local
data. A is chosen to be the new replica and re-sync from C.

... all the data A and B had which C never got is lost forever.

Is the above situation scenario possible? If it is, it seems like the
default behavior of ES might be better to not reallocate in this scenario?
This would have caused the write in step #4 to fail, but in our use case,
that is preferable to data loss.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f7341384-4c88-4e10-a731-f1e6792d6bdd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Evan_Tahler · October 24, 2014, 4:58pm

Interesting @Jörg
How many nodes would you need then to not replicate all data on all nodes?
A highly-touted feature of ES is the ability to share and spread data
across nodes. Any recommendations?

--
Evan Tahler | evantahler@gmail.com | 412.897.6361
evantahler.com | actionherojs.com

On Fri, Oct 24, 2014 at 7:05 AM, Jörg Prante joergprante@gmail.com wrote:

If you have replica level 1 with 3 nodes, this is not enough. You must set
replica level 2. With replica level 1 and outage of 2 nodes, as you
describe, you will lose data.

Jörg

On Wednesday, October 15, 2014 12:52:31 AM UTC+2, Evan Tahler wrote:

Hi Mailing List! I'm a first-time poster, and a long time reader.

We recently had a crash in our ES (1.3.1 on Ubuntu) cluster which caused
us to loose a significant volume of data. I have a "theory" on what
happened to cause this, and I would love to hear your opinions on this, and
if you have any suggestions to mitigate it.

Here is a simplified play-by-play:

Cluster has 3 data nodes, A, B, and C. The index has 10 shards.
The index has a replica count of 1, so A is the master and B is a replica.
C is doing nothing. Re-allocation of indexes/shards is enabled.

A crashes. B takes over as master, and then starts transferring
data to C as a new replica.

B crashes. C is now master with an impartial dataset.

There is a write to the index.

A and B finally reboot, and they are told that they are now stale
(as C had a write while they were away). Both A and B delete their local
data. A is chosen to be the new replica and re-sync from C.

... all the data A and B had which C never got is lost forever.

Is the above situation scenario possible? If it is, it seems like the
default behavior of ES might be better to not reallocate in this scenario?
This would have caused the write in step #4 to fail, but in our use case,
that is preferable to data loss.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/M17mgdZnikk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f7341384-4c88-4e10-a731-f1e6792d6bdd%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f7341384-4c88-4e10-a731-f1e6792d6bdd%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOhjaCCzjcWkVkGTtgw9h%2B1j2wCu1%3D6pOEpFEteH0%2B17F_N9rw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Boaz_Leskes · October 24, 2014, 9:20pm

Hi Evan,

As Jorg said (though I wouldn't make the replica count == node count a
golden rule), if you have 2 copies of your data it means that you are
resilient to one failure at the time. If another failure occurs while you
are still recovering from the first, bad things may happen. That said, I'm
not sure loosing data is explainable but what you described.

When you have 10 shards, each with 1 copy it means you have 20 shards total
to spread around the cluster. Node C should have some shards assigned to
it. When A crashed, ES starts to compensate for the lost extra copies by
replicating shards from B to C (and maybe from C to B as well).

When ES starts to copy shards from one node to another, the nodes on the
target node (C in this case) are marked as initializing. Only once all data
is copied are they marked as started and can accept new writes. What should
have happened here is that C becomes master but the index (and cluster
becomes RED), this is because there is no active shard in one of the shard
groups. At that point no writes are possible to that shard group.

Obviously this is not what happened to you. Do you have any information
recorded from the problematic time? logs, cluster state, marvel data etc.

Cheers,
Boaz

On Friday, October 24, 2014 6:59:19 PM UTC+2, Evan Tahler wrote:

Interesting @Jörg
How many nodes would you need then to not replicate all data on all
nodes? A highly-touted feature of ES is the ability to share and spread
data across nodes. Any recommendations?

--
Evan Tahler | evantahler@gmail.com | 412.897.6361
evantahler.com | actionherojs.com

On Fri, Oct 24, 2014 at 7:05 AM, Jörg Prante joergprante@gmail.com
wrote:

If you have replica level 1 with 3 nodes, this is not enough. You must
set replica level 2. With replica level 1 and outage of 2 nodes, as you
describe, you will lose data.

Jörg

On Wednesday, October 15, 2014 12:52:31 AM UTC+2, Evan Tahler wrote:

Hi Mailing List! I'm a first-time poster, and a long time reader.

We recently had a crash in our ES (1.3.1 on Ubuntu) cluster which caused
us to loose a significant volume of data. I have a "theory" on what
happened to cause this, and I would love to hear your opinions on this, and
if you have any suggestions to mitigate it.

Here is a simplified play-by-play:

Cluster has 3 data nodes, A, B, and C. The index has 10 shards.
The index has a replica count of 1, so A is the master and B is a replica.
C is doing nothing. Re-allocation of indexes/shards is enabled.

A crashes. B takes over as master, and then starts transferring
data to C as a new replica.

B crashes. C is now master with an impartial dataset.

There is a write to the index.

A and B finally reboot, and they are told that they are now stale
(as C had a write while they were away). Both A and B delete their local
data. A is chosen to be the new replica and re-sync from C.

... all the data A and B had which C never got is lost forever.

Is the above situation scenario possible? If it is, it seems like the
default behavior of ES might be better to not reallocate in this scenario?
This would have caused the write in step #4 to fail, but in our use case,
that is preferable to data loss.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/M17mgdZnikk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f7341384-4c88-4e10-a731-f1e6792d6bdd%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f7341384-4c88-4e10-a731-f1e6792d6bdd%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fb442e63-6553-482a-a9cf-5fb3e2146995%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Data loss with 0.19.8 Elasticsearch	3	636	July 6, 2017
How does Elasticsearch protect from data loss? Elasticsearch	6	3202	July 5, 2017
Multiple indexes break elasticsearch (2.3.1) cluster replication? Elasticsearch	5	395	January 18, 2019
Sudden data loss! Elasticsearch	15	5348	July 6, 2017
Am I losing data? Elasticsearch	5	330	April 27, 2022

Partial index replication causes data loss?

Related topics