ES0.15.2 - network partition can lead to data loss

ppearcy · March 11, 2011, 12:57am

FYI, first encountered the issue in 14.2 and discussed it in this
thread:
http://elasticsearch-users.115913.n3.nabble.com/Cluster-partition-resulted-in-loss-of-data-td2569870.html#none

Upgraded to 15.2 and ran various disaster test cases. Received data
loss with the one described below.

Three node cluster (1,2, and 3)
All indexes have replica = 1
Dropped 1 from the network
Waited for 2 and 3 to disconnect from 1
2 and 3 are in yellow state, with shards rebalancing
After ~5 minutes (while other cluster was still yellow) brought 1
back onto the network
A few more minutes later and 1 connects back to 2 and 3
BOOM! With this one we ended up losing ~15% of the documents
(Started with ~5.5 million docs and ended up with ~4.5 million)

I am getting logs together and will have them posted tonight.

Thanks,
Paul

ppearcy · March 11, 2011, 7:30am

Here are the log files:
d101 - http://dl.dropbox.com/u/12095883/d101-dev-0.15.2.log
d102 - http://dl.dropbox.com/u/12095883/d102-dev-0.15.2.log
d103 - http://dl.dropbox.com/u/12095883/d103-dev-0.15.2.log

Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.

Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.

Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.

Let me know if you need any more details or log captures with more
details.

Thanks,
Paul

On Mar 10, 5:57 pm, Paul ppea...@gmail.com wrote:

FYI, first encountered the issue in 14.2 and discussed it in this
thread:http://elasticsearch-users.115913.n3.nabble.com/Cluster-partition-res...

Upgraded to 15.2 and ran various disaster test cases. Received data
loss with the one described below.

Three node cluster (1,2, and 3)

All indexes have replica = 1

Dropped 1 from the network

Waited for 2 and 3 to disconnect from 1

2 and 3 are in yellow state, with shards rebalancing

After ~5 minutes (while other cluster was still yellow) brought 1
back onto the network

A few more minutes later and 1 connects back to 2 and 3

BOOM! With this one we ended up losing ~15% of the documents
(Started with ~5.5 million docs and ended up with ~4.5 million)

I am getting logs together and will have them posted tonight.

Thanks,
Paul

kimchy · March 11, 2011, 4:03pm

Hi,

Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?

-shay.banon
On Friday, March 11, 2011 at 9:30 AM, Paul wrote:

Here are the log files:
d101 - http://dl.dropbox.com/u/12095883/d101-dev-0.15.2.log
d102 - http://dl.dropbox.com/u/12095883/d102-dev-0.15.2.log
d103 - http://dl.dropbox.com/u/12095883/d103-dev-0.15.2.log

Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.

Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.

Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.

Let me know if you need any more details or log captures with more
details.

Thanks,
Paul

On Mar 10, 5:57 pm, Paul ppea...@gmail.com wrote:

FYI, first encountered the issue in 14.2 and discussed it in this
thread:http://elasticsearch-users.115913.n3.nabble.com/Cluster-partition-res...

Upgraded to 15.2 and ran various disaster test cases. Received data
loss with the one described below.

Three node cluster (1,2, and 3)

All indexes have replica = 1

Dropped 1 from the network

Waited for 2 and 3 to disconnect from 1

2 and 3 are in yellow state, with shards rebalancing

After ~5 minutes (while other cluster was still yellow) brought 1
back onto the network

A few more minutes later and 1 connects back to 2 and 3

BOOM! With this one we ended up losing ~15% of the documents
(Started with ~5.5 million docs and ended up with ~4.5 million)

I am getting logs together and will have them posted tonight.

Thanks,
Paul

ppearcy · March 11, 2011, 4:10pm

These are the correct logs. I can bump up certain logging if needed.

Thanks

On Mar 11, 9:03 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?

-shay.banon

On Friday, March 11, 2011 at 9:30 AM, Paul wrote:

Here are the log files:
d101 -http://dl.dropbox.com/u/12095883/d101-dev-0.15.2.log
d102 -http://dl.dropbox.com/u/12095883/d102-dev-0.15.2.log
d103 -http://dl.dropbox.com/u/12095883/d103-dev-0.15.2.log

Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.

Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.

Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.

Let me know if you need any more details or log captures with more
details.

Thanks,
Paul

On Mar 10, 5:57 pm, Paul ppea...@gmail.com wrote:

FYI, first encountered the issue in 14.2 and discussed it in this
thread:http://elasticsearch-users.115913.n3.nabble.com/Cluster-partition-res...

Upgraded to 15.2 and ran various disaster test cases. Received data
loss with the one described below.

Three node cluster (1,2, and 3)

All indexes have replica = 1

Dropped 1 from the network

Waited for 2 and 3 to disconnect from 1

2 and 3 are in yellow state, with shards rebalancing

After ~5 minutes (while other cluster was still yellow) brought 1
back onto the network

A few more minutes later and 1 connects back to 2 and 3

BOOM! With this one we ended up losing ~15% of the documents
(Started with ~5.5 million docs and ended up with ~4.5 million)

I am getting logs together and will have them posted tonight.

Thanks,
Paul

ppearcy · March 11, 2011, 9:35pm

Running more tests where I have every shard on every node to see if I
can get the same behavior. So far, I haven't been able to reproduce.

I wanted to note that I do not get any cluster disconnect indications
in the logs with default logging on 15.2. All I see is a timeout
exception occur for an index operation.

Please let me know any further details needed to debug and I would be
happy to get any logs needed, just let me know what levels you want me
to bump up.

Thanks,
Paul

On Mar 11, 9:10 am, Paul ppea...@gmail.com wrote:

These are the correct logs. I can bump up certain logging if needed.

Thanks

On Mar 11, 9:03 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?

-shay.banon

On Friday, March 11, 2011 at 9:30 AM, Paul wrote:

Here are the log files:
d101 -http://dl.dropbox.com/u/12095883/d101-dev-0.15.2.log
d102 -http://dl.dropbox.com/u/12095883/d102-dev-0.15.2.log
d103 -http://dl.dropbox.com/u/12095883/d103-dev-0.15.2.log

Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.

Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.

Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.

Let me know if you need any more details or log captures with more
details.

Thanks,
Paul

On Mar 10, 5:57 pm, Paul ppea...@gmail.com wrote:

FYI, first encountered the issue in 14.2 and discussed it in this
thread:http://elasticsearch-users.115913.n3.nabble.com/Cluster-partition-res...

Upgraded to 15.2 and ran various disaster test cases. Received data
loss with the one described below.

Three node cluster (1,2, and 3)

All indexes have replica = 1

Dropped 1 from the network

Waited for 2 and 3 to disconnect from 1

2 and 3 are in yellow state, with shards rebalancing

After ~5 minutes (while other cluster was still yellow) brought 1
back onto the network

A few more minutes later and 1 connects back to 2 and 3

BOOM! With this one we ended up losing ~15% of the documents
(Started with ~5.5 million docs and ended up with ~4.5 million)

I am getting logs together and will have them posted tonight.

Thanks,
Paul

kimchy · March 12, 2011, 8:04am

Got you. So, to validate the scenario, you disconnect d101 from the network, and keep indexing docs, and later reconnect it. It is not considered as disconnected from the cluster (since the fault detection has not indicated that, since it did not break its timeout limit), but shards move around because the index operation failed (they timeout).

If thats the case, I can also try and recreate it locally.
On Friday, March 11, 2011 at 11:35 PM, Paul wrote:

Running more tests where I have every shard on every node to see if I
can get the same behavior. So far, I haven't been able to reproduce.

I wanted to note that I do not get any cluster disconnect indications
in the logs with default logging on 15.2. All I see is a timeout
exception occur for an index operation.

Please let me know any further details needed to debug and I would be
happy to get any logs needed, just let me know what levels you want me
to bump up.

Thanks,
Paul

On Mar 11, 9:10 am, Paul ppea...@gmail.com wrote:

These are the correct logs. I can bump up certain logging if needed.

Thanks

On Mar 11, 9:03 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?

-shay.banon

On Friday, March 11, 2011 at 9:30 AM, Paul wrote:

Here are the log files:
d101 -http://dl.dropbox.com/u/12095883/d101-dev-0.15.2.log
d102 -http://dl.dropbox.com/u/12095883/d102-dev-0.15.2.log
d103 -http://dl.dropbox.com/u/12095883/d103-dev-0.15.2.log

Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.

Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.

Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.

Let me know if you need any more details or log captures with more
details.

Thanks,
Paul

On Mar 10, 5:57 pm, Paul ppea...@gmail.com wrote:

FYI, first encountered the issue in 14.2 and discussed it in this
thread:http://elasticsearch-users.115913.n3.nabble.com/Cluster-partition-res...

Upgraded to 15.2 and ran various disaster test cases. Received data
loss with the one described below.

Three node cluster (1,2, and 3)

All indexes have replica = 1

Dropped 1 from the network

Waited for 2 and 3 to disconnect from 1

2 and 3 are in yellow state, with shards rebalancing

After ~5 minutes (while other cluster was still yellow) brought 1
back onto the network

A few more minutes later and 1 connects back to 2 and 3

BOOM! With this one we ended up losing ~15% of the documents
(Started with ~5.5 million docs and ended up with ~4.5 million)

I am getting logs together and will have them posted tonight.

Thanks,
Paul

ppearcy · March 14, 2011, 5:43pm

Great, thanks. I am a little surprised that a failed index operation
would kick in a cluster rebalance. I figured the index operation would
just fail and the cluster wouldn't start rebalancing until the TCP
timeout kicked in.

I'd be very happy to give any potential fixes a spin on my side to see
if I can still reproduce.

Thanks!
Paul

On Mar 12, 2:04 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Got you. So, to validate the scenario, you disconnect d101 from the network, and keep indexing docs, and later reconnect it. It is not considered as disconnected from the cluster (since the fault detection has not indicated that, since it did not break its timeout limit), but shards move around because the index operation failed (they timeout).

If thats the case, I can also try and recreate it locally.

On Friday, March 11, 2011 at 11:35 PM, Paul wrote:

Running more tests where I have every shard on every node to see if I
can get the same behavior. So far, I haven't been able to reproduce.

I wanted to note that I do not get any cluster disconnect indications
in the logs with default logging on 15.2. All I see is a timeout
exception occur for an index operation.

Please let me know any further details needed to debug and I would be
happy to get any logs needed, just let me know what levels you want me
to bump up.

Thanks,
Paul

On Mar 11, 9:10 am, Paul ppea...@gmail.com wrote:

These are the correct logs. I can bump up certain logging if needed.

Thanks

On Mar 11, 9:03 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?

-shay.banon

On Friday, March 11, 2011 at 9:30 AM, Paul wrote:

Here are the log files:
d101 -http://dl.dropbox.com/u/12095883/d101-dev-0.15.2.log
d102 -http://dl.dropbox.com/u/12095883/d102-dev-0.15.2.log
d103 -http://dl.dropbox.com/u/12095883/d103-dev-0.15.2.log

Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.

Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.

Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.

Let me know if you need any more details or log captures with more
details.

Thanks,
Paul

On Mar 10, 5:57 pm, Paul ppea...@gmail.com wrote:

FYI, first encountered the issue in 14.2 and discussed it in this
thread:http://elasticsearch-users.115913.n3.nabble.com/Cluster-partition-res...

Upgraded to 15.2 and ran various disaster test cases. Received data
loss with the one described below.

Three node cluster (1,2, and 3)

All indexes have replica = 1

Dropped 1 from the network

Waited for 2 and 3 to disconnect from 1

2 and 3 are in yellow state, with shards rebalancing

After ~5 minutes (while other cluster was still yellow) brought 1
back onto the network

A few more minutes later and 1 connects back to 2 and 3

BOOM! With this one we ended up losing ~15% of the documents
(Started with ~5.5 million docs and ended up with ~4.5 million)

I am getting logs together and will have them posted tonight.

Thanks,
Paul

ppearcy · March 14, 2011, 10:49pm

Ah, cool stuff.

Curious can this same logic can be applied when the mapping for a
field is missing in an index? Right now, if you want a search to span
two indexes it will fail if the sort field isn't defined in both. It'd
be convenient to optionally treat the one missing the sort field
mapping as null.

Thanks

On Mar 14, 11:43 am, Paul ppea...@gmail.com wrote:

Great, thanks. I am a little surprised that a failed index operation
would kick in a cluster rebalance. I figured the index operation would
just fail and the cluster wouldn't start rebalancing until the TCP
timeout kicked in.

I'd be very happy to give any potential fixes a spin on my side to see
if I can still reproduce.

Thanks!
Paul

On Mar 12, 2:04 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Got you. So, to validate the scenario, you disconnect d101 from the network, and keep indexing docs, and later reconnect it. It is not considered as disconnected from the cluster (since the fault detection has not indicated that, since it did not break its timeout limit), but shards move around because the index operation failed (they timeout).

If thats the case, I can also try and recreate it locally.

On Friday, March 11, 2011 at 11:35 PM, Paul wrote:

Running more tests where I have every shard on every node to see if I
can get the same behavior. So far, I haven't been able to reproduce.

I wanted to note that I do not get any cluster disconnect indications
in the logs with default logging on 15.2. All I see is a timeout
exception occur for an index operation.

Please let me know any further details needed to debug and I would be
happy to get any logs needed, just let me know what levels you want me
to bump up.

Thanks,
Paul

On Mar 11, 9:10 am, Paul ppea...@gmail.com wrote:

These are the correct logs. I can bump up certain logging if needed.

Thanks

On Mar 11, 9:03 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?

-shay.banon

On Friday, March 11, 2011 at 9:30 AM, Paul wrote:

Here are the log files:
d101 -http://dl.dropbox.com/u/12095883/d101-dev-0.15.2.log
d102 -http://dl.dropbox.com/u/12095883/d102-dev-0.15.2.log
d103 -http://dl.dropbox.com/u/12095883/d103-dev-0.15.2.log

Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.

Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.

Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.

Let me know if you need any more details or log captures with more
details.

Thanks,
Paul

On Mar 10, 5:57 pm, Paul ppea...@gmail.com wrote:

FYI, first encountered the issue in 14.2 and discussed it in this
thread:http://elasticsearch-users.115913.n3.nabble.com/Cluster-partition-res...

Upgraded to 15.2 and ran various disaster test cases. Received data
loss with the one described below.

Three node cluster (1,2, and 3)

All indexes have replica = 1

Dropped 1 from the network

Waited for 2 and 3 to disconnect from 1

2 and 3 are in yellow state, with shards rebalancing

After ~5 minutes (while other cluster was still yellow) brought 1
back onto the network

A few more minutes later and 1 connects back to 2 and 3

BOOM! With this one we ended up losing ~15% of the documents
(Started with ~5.5 million docs and ended up with ~4.5 million)

I am getting logs together and will have them posted tonight.

Thanks,
Paul

kimchy · March 14, 2011, 11:16pm

I think you posted this on the wrong thread :). Maybe it relates to handling missing values when sorting? If so, it might be possible to handle cases when teh field is not defined (though a bit tricky), open an issue for it so we won't loose track.
On Tuesday, March 15, 2011 at 12:49 AM, Paul wrote:

Ah, cool stuff.

Curious can this same logic can be applied when the mapping for a
field is missing in an index? Right now, if you want a search to span
two indexes it will fail if the sort field isn't defined in both. It'd
be convenient to optionally treat the one missing the sort field
mapping as null.

Thanks

On Mar 14, 11:43 am, Paul ppea...@gmail.com wrote:

Great, thanks. I am a little surprised that a failed index operation
would kick in a cluster rebalance. I figured the index operation would
just fail and the cluster wouldn't start rebalancing until the TCP
timeout kicked in.

I'd be very happy to give any potential fixes a spin on my side to see
if I can still reproduce.

Thanks!
Paul

On Mar 12, 2:04 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Got you. So, to validate the scenario, you disconnect d101 from the network, and keep indexing docs, and later reconnect it. It is not considered as disconnected from the cluster (since the fault detection has not indicated that, since it did not break its timeout limit), but shards move around because the index operation failed (they timeout).

If thats the case, I can also try and recreate it locally.

On Friday, March 11, 2011 at 11:35 PM, Paul wrote:

Running more tests where I have every shard on every node to see if I
can get the same behavior. So far, I haven't been able to reproduce.

I wanted to note that I do not get any cluster disconnect indications
in the logs with default logging on 15.2. All I see is a timeout
exception occur for an index operation.

Please let me know any further details needed to debug and I would be
happy to get any logs needed, just let me know what levels you want me
to bump up.

Thanks,
Paul

On Mar 11, 9:10 am, Paul ppea...@gmail.com wrote:

These are the correct logs. I can bump up certain logging if needed.

Thanks

On Mar 11, 9:03 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?

-shay.banon

On Friday, March 11, 2011 at 9:30 AM, Paul wrote:

Here are the log files:
d101 -http://dl.dropbox.com/u/12095883/d101-dev-0.15.2.log
d102 -http://dl.dropbox.com/u/12095883/d102-dev-0.15.2.log
d103 -http://dl.dropbox.com/u/12095883/d103-dev-0.15.2.log

Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.

Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.

Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.

Let me know if you need any more details or log captures with more
details.

Thanks,
Paul

On Mar 10, 5:57 pm, Paul ppea...@gmail.com wrote:

FYI, first encountered the issue in 14.2 and discussed it in this
thread:http://elasticsearch-users.115913.n3.nabble.com/Cluster-partition-res...

Upgraded to 15.2 and ran various disaster test cases. Received data
loss with the one described below.

Three node cluster (1,2, and 3)

All indexes have replica = 1

Dropped 1 from the network

Waited for 2 and 3 to disconnect from 1

2 and 3 are in yellow state, with shards rebalancing

After ~5 minutes (while other cluster was still yellow) brought 1
back onto the network

A few more minutes later and 1 connects back to 2 and 3

BOOM! With this one we ended up losing ~15% of the documents
(Started with ~5.5 million docs and ended up with ~4.5 million)

I am getting logs together and will have them posted tonight.

Thanks,
Paul

ppearcy · March 15, 2011, 5:08pm

Doh! My bad. Stomping on my own thread

Yeah, will open a feature request. Thanks!

On Mar 14, 5:16 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

I think you posted this on the wrong thread :). Maybe it relates to handling missing values when sorting? If so, it might be possible to handle cases when teh field is not defined (though a bit tricky), open an issue for it so we won't loose track.

On Tuesday, March 15, 2011 at 12:49 AM, Paul wrote:

Ah, cool stuff.

Curious can this same logic can be applied when the mapping for a
field is missing in an index? Right now, if you want a search to span
two indexes it will fail if the sort field isn't defined in both. It'd
be convenient to optionally treat the one missing the sort field
mapping as null.

Thanks

On Mar 14, 11:43 am, Paul ppea...@gmail.com wrote:

Great, thanks. I am a little surprised that a failed index operation
would kick in a cluster rebalance. I figured the index operation would
just fail and the cluster wouldn't start rebalancing until the TCP
timeout kicked in.

I'd be very happy to give any potential fixes a spin on my side to see
if I can still reproduce.

Thanks!
Paul

On Mar 12, 2:04 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Got you. So, to validate the scenario, you disconnect d101 from the network, and keep indexing docs, and later reconnect it. It is not considered as disconnected from the cluster (since the fault detection has not indicated that, since it did not break its timeout limit), but shards move around because the index operation failed (they timeout).

If thats the case, I can also try and recreate it locally.

On Friday, March 11, 2011 at 11:35 PM, Paul wrote:

Running more tests where I have every shard on every node to see if I
can get the same behavior. So far, I haven't been able to reproduce.

I wanted to note that I do not get any cluster disconnect indications
in the logs with default logging on 15.2. All I see is a timeout
exception occur for an index operation.

Please let me know any further details needed to debug and I would be
happy to get any logs needed, just let me know what levels you want me
to bump up.

Thanks,
Paul

On Mar 11, 9:10 am, Paul ppea...@gmail.com wrote:

These are the correct logs. I can bump up certain logging if needed.

Thanks

On Mar 11, 9:03 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Thanks for the info, certainly something I would like to fix. You said you saw the d101 node being detected as disconnected from the cluster (resulting in the rebalancing of shards), but I don't see it in the logs. Maybe its in the previous day logs?

-shay.banon

On Friday, March 11, 2011 at 9:30 AM, Paul wrote:

Here are the log files:
d101 -http://dl.dropbox.com/u/12095883/d101-dev-0.15.2.log
d102 -http://dl.dropbox.com/u/12095883/d102-dev-0.15.2.log
d103 -http://dl.dropbox.com/u/12095883/d103-dev-0.15.2.log

Before this, ran a test where d101 was disconnected and reconnected
within the tcp timeout. I had bumped to 180 seconds for testing.

Here are the times where things where going on that can be correlated
to the log files.
00:05 - d101 disconnected from network
00:11 - d101 reconnected to network
00:15 - d101 joins back to d102/d103 cluster which was in yellow
state. This is when the data disappears. It appears to me that d102
recovers empty data from its local gateway and syncs that out to
d101.

Looks like a decent chunk of shards were lost, but a couple of the
specific indexes that were hit appear to be index28 and cbsmw.

Let me know if you need any more details or log captures with more
details.

Thanks,
Paul

On Mar 10, 5:57 pm, Paul ppea...@gmail.com wrote:

FYI, first encountered the issue in 14.2 and discussed it in this
thread:http://elasticsearch-users.115913.n3.nabble.com/Cluster-partition-res...

Upgraded to 15.2 and ran various disaster test cases. Received data
loss with the one described below.

Three node cluster (1,2, and 3)

All indexes have replica = 1

Dropped 1 from the network

Waited for 2 and 3 to disconnect from 1

2 and 3 are in yellow state, with shards rebalancing

After ~5 minutes (while other cluster was still yellow) brought 1
back onto the network

A few more minutes later and 1 connects back to 2 and 3

BOOM! With this one we ended up losing ~15% of the documents
(Started with ~5.5 million docs and ended up with ~4.5 million)

I am getting logs together and will have them posted tonight.

Thanks,
Paul