More split brain + recovery issues

Hi,
Two days ago we've had a split brain situation happen again, this time it
happened while the cluster was in normal operation, not during any
configuration changes like last time.
I don't know the root cause of the issue but at some point our 7 nodes
cluster split into two clusters, one with 5 nodes and one with 2 nodes.
The original configuration is 5 nodes in our primary location and 2 nodes
in a secondary location, but the split was not between locations, one of
the servers in the secondary location was part of the 5 nodes and another
server formed the 2 nodes cluster with another node from the primary
location.
The only way to get those two servers to join the main cluster was to
restarted them, but then shard allocation issues started to show up. After
the restart the cluster got stuck to 7 shards relocating 4 shards
initializing and 10 shards unassigned.
We picked one server (es1-01), more or less at random and restarted it,
this didn't help a lot but after the restart (and increasing the logging
level) the following kind of errors showed up in its log file:

[2012-05-27 11:33:46,841][DEBUG][action.admin.indices.status] [es1-01]
[6110811_xs][0], node[ub7Ak7ckQ_imuk-v0uFQ8g], relocating
[b-CbP5mVRhubE-k600aotw], [P], s[RELOCATING]: Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@76c3ba66]
org.elasticsearch.transport.RemoteTransportException:
[es1-04][inet[/10.1.101.154:9300]][indices/status/s]
Caused by: org.elasticsearch.index.IndexShardMissingException:
[6110811_xs][0] missing
at
org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:175)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:153)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:374)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)
[2012-05-27 11:33:46,843][DEBUG][action.admin.indices.status] [es1-01]
[6151631_s][1], node[ub7Ak7ckQ_imuk-v0uFQ8g], relocating
[b-CbP5mVRhubE-k600aotw], [P], s[RELOCATING]: Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@76c3ba66]
org.elasticsearch.transport.RemoteTransportException:
[es1-04][inet[/10.1.101.154:9300]][indices/status/s]
Caused by: org.elasticsearch.index.IndexShardMissingException:
[6151631_s][1] missing
at
org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:175)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:153)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:374)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)

so we restarted es1-04, which indeed caused the unallocated shards +
relocating shards to go down to 0 and this showed up in es1-01 log:

[2012-05-27 12:34:01,940][INFO ][cluster.service ] [es1-01]
removed
{[es1-04][ub7Ak7ckQ_imuk-v0uFQ8g][inet[/10.1.101.154:9300]]{data_center=ny4},},
reason: zen-disco-receive(from master
[[es1-02][YHfuNBZNTIOz-XS2IDaH8A][inet[/10.1.101.152:9300]]{data_center=ny4}])
[2012-05-27 12:34:07,287][DEBUG][index.gateway ] [es1-01]
[6136561_xs][0] starting recovery from local ...
[2012-05-27 12:34:09,405][DEBUG][index.gateway ] [es1-01]
[6136561_xs][0] recovery completed from local, took [2.1s]
index : files [35] with total_size [1mb], took[0s]
: recovered_files [0] with total_size [0b]
: reusing_files [35] with total_size [1mb]
start : took [56ms], check_index [0s]
translog : number_of_operations [2], took [2s]
[2012-05-27 12:34:48,817][INFO ][cluster.service ] [es1-01] added
{[es1-04][hK0kWxmXT8yNjCcKqzcZIg][inet[/10.1.101.154:9300]]{data_center=ny4},},
reason: zen-disco-receive(from master
[[es1-02][YHfuNBZNTIOz-XS2IDaH8A][inet[/10.1.101.152:9300]]{data_center=ny4}])

However we were still stuck with 4 initializing shards, checking the
cluster state indicated es1-02 and es1-05 were still initializing those
shards. The log on es1-05, which had one of the "Initializing" shards
showed many messages like these (some are more rare than others):

[2012-05-28 05:21:39,113][WARN ][cluster.action.shard ] [es1-05]
received shard failed for [6131251_l][6], node[b-CbP5mVRhubE-k600aotw],
[P], s[STARTED], reason [master
[es1-05][2oGhhbfxTVmakTROvKY3Xg][inet[/10.1.101.155:9300]]{data_center=ny4}
marked shard as started, but shard have not been created, mark shard as
failed]
[2012-05-28 05:21:42,109][WARN ][cluster.action.shard ] [es1-05]
received shard failed for [6131251_l][6], node[b-CbP5mVRhubE-k600aotw],
[P], s[STARTED], reason [master
[es1-05][2oGhhbfxTVmakTROvKY3Xg][inet[/10.1.101.155:9300]]{data_center=ny4}
marked shard as started, but shard have not been created, mark shard as
failed]
[2012-05-28 05:21:43,202][DEBUG][cluster.service ] [es1-05]
processing [shard-failed ([6131251_l][6], node[b-CbP5mVRhubE-k600aotw],
[P], s[STARTED]), reason [master
[es1-05][2oGhhbfxTVmakTROvKY3Xg][inet[/10.1.101.155:9300]]{data_center=ny4}
marked shard as started, but shard have not been created, mark shard as
failed]]: execute
[2012-05-28 05:21:43,202][DEBUG][cluster.action.shard ] [es1-05]
Received failed shard [6131251_l][6], node[b-CbP5mVRhubE-k600aotw], [P],
s[STARTED], reason [master
[es1-05][2oGhhbfxTVmakTROvKY3Xg][inet[/10.1.101.155:9300]]{data_center=ny4}
marked shard as started, but shard have not been created, mark shard as
failed]
[2012-05-28 05:21:43,212][DEBUG][gateway.local ] [es1-05]
[6131251_l][6]: allocating [[6131251_l][6], node[null], [P], s[UNASSIGNED]]
to
[[es1-aws-01][n1ZFyLC7QrmEl4SK5lO4ng][inet[/10.2.101.17:9300]]{data_center=us-east-1}]
on primary allocation
[2012-05-28 05:21:43,385][DEBUG][cluster.action.shard ] [es1-05]
Applying failed shard [6131251_l][6], node[b-CbP5mVRhubE-k600aotw], [P],
s[STARTED], reason [master
[es1-05][2oGhhbfxTVmakTROvKY3Xg][inet[/10.1.101.155:9300]]{data_center=ny4}
marked shard as started, but shard have not been created, mark shard as
failed]

[2012-05-28 05:24:12,215][DEBUG][cluster.service ] [es1-05]
processing [shard-failed ([6131251_l][6], node[b-CbP5mVRhubE-k600aotw],
[P], s[STARTED]), reason [master
[es1-05][2oGhhbfxTVmakTROvKY3Xg][inet[/10.1.101.155:9300]]{data_center=ny4}
marked shard as started, but shard have not been created, mark shard as
failed]]: no change in cluster_state

[2012-05-28 05:25:02,721][DEBUG][cluster.service ] [es1-05]
cluster state updated, version [6077], source [shard-started
([6131251_l][6], node[Yqlst_mwSZ205QAUPnOYBw], [R], s[INITIALIZING]),
reason [after recovery (replica) from node
[[es1-aws-01][n1ZFyLC7QrmEl4SK5lO4ng][inet[/10.2.101.17:9300]]{data_center=us-east-1}]]]
[2012-05-28 05:25:02,740][DEBUG][cluster.service ] [es1-05]
processing [shard-started ([6131251_l][6], node[Yqlst_mwSZ205QAUPnOYBw],
[R], s[INITIALIZING]), reason [after recovery (replica) from node
[[es1-aws-01][n1ZFyLC7QrmEl4SK5lO4ng][inet[/10.2.101.17:9300]]{data_center=us-east-1}]]]:
done applying updated cluster_state

And that's it the log doesn't have any more references to this shard, which
remained "initializing" until the node was restarted.

-eran

Can you mail me directly the full logs of all the servers of the relevant
time, I will have a look...

On Mon, May 28, 2012 at 12:44 PM, Eran eran@gigya-inc.com wrote:

Hi,
Two days ago we've had a split brain situation happen again, this time it
happened while the cluster was in normal operation, not during any
configuration changes like last time.
I don't know the root cause of the issue but at some point our 7 nodes
cluster split into two clusters, one with 5 nodes and one with 2 nodes.
The original configuration is 5 nodes in our primary location and 2 nodes
in a secondary location, but the split was not between locations, one of
the servers in the secondary location was part of the 5 nodes and another
server formed the 2 nodes cluster with another node from the primary
location.
The only way to get those two servers to join the main cluster was to
restarted them, but then shard allocation issues started to show up. After
the restart the cluster got stuck to 7 shards relocating 4 shards
initializing and 10 shards unassigned.
We picked one server (es1-01), more or less at random and restarted it,
this didn't help a lot but after the restart (and increasing the logging
level) the following kind of errors showed up in its log file:

[2012-05-27 11:33:46,841][DEBUG][action.**admin.indices.status] [es1-01]
[6110811_xs][0], node[ub7Ak7ckQ_imuk-v0uFQ8g], relocating
[b-CbP5mVRhubE-k600aotw], [P], s[RELOCATING]: Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@76c3ba66
]
org.elasticsearch.transport.RemoteTransportException:
[es1-04][inet[/10.1.101.154:9300]][indices/status/s]
Caused by: org.elasticsearch.index.IndexShardMissingException:
[6110811_xs][0] missing
at
org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:175)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:153)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:374)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)
[2012-05-27 11:33:46,843][DEBUG][action.admin.indices.status] [es1-01]
[6151631_s][1], node[ub7Ak7ckQ_imuk-v0uFQ8g], relocating
[b-CbP5mVRhubE-k600aotw], [P], s[RELOCATING]: Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@76c3ba66
]
org.elasticsearch.transport.RemoteTransportException:
[es1-04][inet[/10.1.101.154:9300]][indices/status/s]
Caused by: org.elasticsearch.index.IndexShardMissingException:
[6151631_s][1] missing
at
org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:175)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:153)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:374)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)

so we restarted es1-04, which indeed caused the unallocated shards +
relocating shards to go down to 0 and this showed up in es1-01 log:

[2012-05-27 12:34:01,940][INFO ][cluster.service ] [es1-01]
removed {[es1-04][ub7Ak7ckQ_imuk-v0uFQ8g][inet[/10.1.101.154:9300]]{data_center=ny4},},
reason: zen-disco-receive(from master
[[es1-02][YHfuNBZNTIOz-XS2IDaH8A][inet[/10.1.101.152:9300
]]{data_center=ny4}])
[2012-05-27 12:34:07,287][DEBUG][index.gateway ] [es1-01]
[6136561_xs][0] starting recovery from local ...
[2012-05-27 12:34:09,405][DEBUG][index.gateway ] [es1-01]
[6136561_xs][0] recovery completed from local, took [2.1s]
index : files [35] with total_size [1mb], took[0s]
: recovered_files [0] with total_size [0b]
: reusing_files [35] with total_size [1mb]
start : took [56ms], check_index [0s]
translog : number_of_operations [2], took [2s]
[2012-05-27 12:34:48,817][INFO ][cluster.service ] [es1-01] added
{[es1-04][hK0kWxmXT8yNjCcKqzcZIg][inet[/10.1.101.154:9300]]{data_center=ny4},},
reason: zen-disco-receive(from master
[[es1-02][YHfuNBZNTIOz-XS2IDaH8A][inet[/10.1.101.152:9300
]]{data_center=ny4}])

However we were still stuck with 4 initializing shards, checking the
cluster state indicated es1-02 and es1-05 were still initializing those
shards. The log on es1-05, which had one of the "Initializing" shards
showed many messages like these (some are more rare than others):

[2012-05-28 05:21:39,113][WARN ][cluster.action.shard ] [es1-05]
received shard failed for [6131251_l][6], node[b-CbP5mVRhubE-k600aotw],
[P], s[STARTED], reason [master
[es1-05][2oGhhbfxTVmakTROvKY3Xg][inet[/10.1.101.155:9300]]{data_center=ny4}
marked shard as started, but shard have not been created, mark shard as
failed]
[2012-05-28 05:21:42,109][WARN ][cluster.action.shard ] [es1-05]
received shard failed for [6131251_l][6], node[b-CbP5mVRhubE-k600aotw],
[P], s[STARTED], reason [master
[es1-05][2oGhhbfxTVmakTROvKY3Xg][inet[/10.1.101.155:9300]]{data_center=ny4}
marked shard as started, but shard have not been created, mark shard as
failed]
[2012-05-28 05:21:43,202][DEBUG][cluster.service ] [es1-05]
processing [shard-failed ([6131251_l][6], node[b-CbP5mVRhubE-k600aotw],
[P], s[STARTED]), reason [master
[es1-05][2oGhhbfxTVmakTROvKY3Xg][inet[/10.1.101.155:9300]]{data_center=ny4}
marked shard as started, but shard have not been created, mark shard as
failed]]: execute
[2012-05-28 05:21:43,202][DEBUG][cluster.action.shard ] [es1-05]
Received failed shard [6131251_l][6], node[b-CbP5mVRhubE-k600aotw], [P],
s[STARTED], reason [master
[es1-05][2oGhhbfxTVmakTROvKY3Xg][inet[/10.1.101.155:9300]]{data_center=ny4}
marked shard as started, but shard have not been created, mark shard as
failed]
[2012-05-28 05:21:43,212][DEBUG][gateway.local ] [es1-05]
[6131251_l][6]: allocating [[6131251_l][6], node[null], [P], s[UNASSIGNED]]
to [[es1-aws-01][n1ZFyLC7QrmEl4SK5lO4ng][inet[/10.2.101.17:9300]]{data_center=us-east-1}]
on primary allocation
[2012-05-28 05:21:43,385][DEBUG][cluster.action.shard ] [es1-05]
Applying failed shard [6131251_l][6], node[b-CbP5mVRhubE-k600aotw], [P],
s[STARTED], reason [master
[es1-05][2oGhhbfxTVmakTROvKY3Xg][inet[/10.1.101.155:9300]]{data_center=ny4}
marked shard as started, but shard have not been created, mark shard as
failed]

[2012-05-28 05:24:12,215][DEBUG][cluster.service ] [es1-05]
processing [shard-failed ([6131251_l][6], node[b-CbP5mVRhubE-k600aotw],
[P], s[STARTED]), reason [master
[es1-05][2oGhhbfxTVmakTROvKY3Xg][inet[/10.1.101.155:9300]]{data_center=ny4}
marked shard as started, but shard have not been created, mark shard as
failed]]: no change in cluster_state

[2012-05-28 05:25:02,721][DEBUG][cluster.service ] [es1-05]
cluster state updated, version [6077], source [shard-started
([6131251_l][6], node[Yqlst_mwSZ205QAUPnOYBw], [R], s[INITIALIZING]),
reason [after recovery (replica) from node
[[es1-aws-01][n1ZFyLC7QrmEl4SK5lO4ng][inet[/10.2.101.17:9300
]]{data_center=us-east-1}]]]
[2012-05-28 05:25:02,740][DEBUG][cluster.service ] [es1-05]
processing [shard-started ([6131251_l][6], node[Yqlst_mwSZ205QAUPnOYBw],
[R], s[INITIALIZING]), reason [after recovery (replica) from node
[[es1-aws-01][n1ZFyLC7QrmEl4SK5lO4ng][inet[/10.2.101.17:9300]]{data_center=us-east-1}]]]:
done applying updated cluster_state

And that's it the log doesn't have any more references to this shard,
which remained "initializing" until the node was restarted.

-eran*********************************************************************