Data loss after network disconnect

Israel_Tsadok · September 10, 2014, 10:25am

A temporary network disconnect of the master node caused a torrent of
RELOCATING shards, and then one shard remained UNASSIGNED and the cluster
state was left red.

looking inside the index directory for the shard on the disk, I found that
it was empty (i.e., the _state and translog dirs were there, but the index
dir had no files).

Looking at the log files, I see that the disconnect happened around
11:42:05, and a few minutes later I start seeing these error messages:

[2014-09-10 11:45:33,341][WARN ][indices.cluster ]
[buzzilla_data008] [el-2011-10-31-0000][0] failed to start shard
[2014-09-10 11:45:33,342][WARN ][cluster.action.shard ]
[buzzilla_data008] [el-2011-10-31-0000][0] sending failed shard for
[el-2011-10-31-0000][0], node[RAR26zfuTiKl4mdbRVTtNA], [P],
s[INITIALIZING], indexUUID [na], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] failed to fetch
index version after copying it over]; nested:
IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] shard allocated
for local recovery (post api), should exist, but doesn't, current files:
[]]; nested: IndexNotFoundException[no segments* file found in
store(least_used[rate_limited(mmapfs(/home/omgili/data/elasticsearch/data/buzzilla/nodes/0/indices/el-2011-10-31-0000/0/index),
type=MERGE, rate=20.0)]): files: []]; ]]

The relevant log files are at

gist.github.com

https://gist.github.com/itsadok/97453743d6b211681aca

buzzilla_data008.log

[2014-09-10 11:42:05,521][INFO ][discovery.zen            ] [buzzilla_data008] master_left [[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2014-09-10 11:42:05,522][INFO ][cluster.service          ] [buzzilla_data008] master {new [buzzilla_data017][239VxZmHR7m3rI0hb0CqFA][data017][inet[/10.1.2.11:9300]]{tag=gen1}, previous [buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}}, removed {[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1},}, reason: zen-disco-master_failed ([buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1})
[2014-09-10 11:42:05,526][WARN ][cluster.action.shard     ] [buzzilla_data008] failed to send shard started to [[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}]
[2014-09-10 11:42:36,909][INFO ][cluster.service          ] [buzzilla_data008] detected_master [buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}, added {[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1},}, reason: zen-disco-receive(from master [[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}])
[2014-09-10 11:43:01,776][INFO ][cluster.service          ] [buzzilla_data008] master {new [buzzilla_data017][239VxZmHR7m3rI0hb0CqFA][data017][inet[/10.1.2.11:9300]]{tag=gen1}, previous [buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}}, removed {[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1},}, reason: zen-disco-receive(from master [[buzzilla_data017][239VxZmHR7m3rI0hb0CqFA][data017][inet[/10.1.2.11:9300]]{tag=gen1}])
[2014-09-10 11:44:55,034][INFO ][cluster.service          ] [buzzilla_data008] detected_master [buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}, added {[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1},}, reason: zen-disco-receive(from master [[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}])
[2014-09-10 11:45:32,407][INFO ][cluster.service          ] [buzzilla_data008] master {new [buzzilla_data017][239VxZmHR7m3rI0hb0CqFA][data017][inet[/10.1.2.11:9300]]{tag=gen1}, previous [buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}}, removed {[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1},[buzzilla_data002][dFFQV06aSA6gRmq5pAh5pg][data002][inet[/10.10.5.139:9300]]{tag=gen1},}, reason: zen-disco-receive(from master [[buzzilla_data017][239VxZmHR7m3rI0hb0CqFA][data017][inet[/10.1.2.11:9300]]{tag=gen1}])
[2014-09-10 11:45:32,434][WARN ][indices.cluster          ] [buzzilla_data008] [el-2011-10-31-0000][0] master [[buzzilla_data017][239VxZmHR7m3rI0hb0CqFA][data017][inet[/10.1.2.11:9300]]{tag=gen1}] marked shard as started, but shard has not been created, mark shard as failed
[2014-09-10 11:45:32,434][WARN ][cluster.action.shard     ] [buzzilla_data008] [el-2011-10-31-0000][0] sending failed shard for [el-2011-10-31-0000][0], node[RAR26zfuTiKl4mdbRVTtNA], [P], s[STARTED], indexUUID [_na_], reason [master [buzzilla_data017][239VxZmHR7m3rI0hb0CqFA][data017][inet[/10.1.2.11:9300]]{tag=gen1} marked shard as started, but shard has not been created, mark shard as failed]
[2014-09-10 11:45:32,445][DEBUG][action.search.type       ] [buzzilla_data008] [el-2012-10-15-0000][0], node[dFFQV06aSA6gRmq5pAh5pg], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@5c5f32ea] lastShard [true]

This file has been truncated. show original

buzzilla_data009.log

[2014-09-10 11:42:35,836][WARN ][transport                ] [buzzilla_data009] Received response for a request that has timed out, sent [137974ms] ago, timed out [11ms] ago, action [discovery/zen/fd/ping], node [[buzzilla_data002][dFFQV06aSA6gRmq5pAh5pg][data002][inet[/10.10.5.139:9300]]{tag=gen1}], id [922947339]
[2014-09-10 11:44:55,565][WARN ][cluster.action.shard     ] [buzzilla_data009] [el-2011-10-31-0000][0] received shard failed for [el-2011-10-31-0000][0], node[KYBJqQmaT2SSGzBblM9dtQ], [P], s[STARTED], indexUUID [_na_], reason [master [buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1} marked shard as started, but shard has not been created, mark shard as failed]
[2014-09-10 11:45:09,918][DEBUG][action.admin.cluster.node.stats] [buzzilla_data009] failed to execute on node [dFFQV06aSA6gRmq5pAh5pg]
[2014-09-10 11:45:39,918][DEBUG][action.admin.cluster.node.stats] [buzzilla_data009] failed to execute on node [dFFQV06aSA6gRmq5pAh5pg]
[2014-09-10 11:46:09,918][DEBUG][action.admin.cluster.node.stats] [buzzilla_data009] failed to execute on node [dFFQV06aSA6gRmq5pAh5pg]
[2014-09-10 11:46:39,918][DEBUG][action.admin.cluster.node.stats] [buzzilla_data009] failed to execute on node [dFFQV06aSA6gRmq5pAh5pg]
[2014-09-10 11:46:47,190][WARN ][transport                ] [buzzilla_data009] Received response for a request that has timed out, sent [111345ms] ago, timed out [81345ms] ago, action [discovery/zen/fd/ping], node [[buzzilla_data002][dFFQV06aSA6gRmq5pAh5pg][data002][inet[/10.10.5.139:9300]]{tag=gen1}], id [922949468]
[2014-09-10 11:46:47,191][WARN ][transport                ] [buzzilla_data009] Received response for a request that has timed out, sent [81345ms] ago, timed out [51345ms] ago, action [discovery/zen/fd/ping], node [[buzzilla_data002][dFFQV06aSA6gRmq5pAh5pg][data002][inet[/10.10.5.139:9300]]{tag=gen1}], id [922951268]
[2014-09-10 11:46:47,191][WARN ][transport                ] [buzzilla_data009] Received response for a request that has timed out, sent [51346ms] ago, timed out [21345ms] ago, action [discovery/zen/fd/ping], node [[buzzilla_data002][dFFQV06aSA6gRmq5pAh5pg][data002][inet[/10.10.5.139:9300]]{tag=gen1}], id [922952839]
[2014-09-10 11:46:47,209][WARN ][transport                ] [buzzilla_data009] Received response for a request that has timed out, sent [82291ms] ago, timed out [67291ms] ago, action [cluster/nodes/stats/n], node [[buzzilla_data002][dFFQV06aSA6gRmq5pAh5pg][data002][inet[/10.10.5.139:9300]]{tag=gen1}], id [922950947]

This file has been truncated. show original

buzzilla_data017.log

[2014-09-10 11:42:05,715][INFO ][discovery.zen            ] [buzzilla_data017] master_left [[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2014-09-10 11:42:05,747][INFO ][cluster.service          ] [buzzilla_data017] master {new [buzzilla_data017][239VxZmHR7m3rI0hb0CqFA][data017][inet[/10.1.2.11:9300]]{tag=gen1}, previous [buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}}, removed {[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1},}, reason: zen-disco-master_failed ([buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1})
[2014-09-10 11:42:35,790][WARN ][transport                ] [buzzilla_data017] Received response for a request that has timed out, sent [120079ms] ago, timed out [90079ms] ago, action [discovery/zen/fd/masterPing], node [[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}], id [13696023]
[2014-09-10 11:42:35,790][WARN ][transport                ] [buzzilla_data017] Received response for a request that has timed out, sent [90076ms] ago, timed out [60076ms] ago, action [discovery/zen/fd/masterPing], node [[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}], id [13696024]
[2014-09-10 11:42:35,790][WARN ][transport                ] [buzzilla_data017] Received response for a request that has timed out, sent [60076ms] ago, timed out [30076ms] ago, action [discovery/zen/fd/masterPing], node [[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}], id [13696025]
[2014-09-10 11:42:35,800][WARN ][cluster.action.shard     ] [buzzilla_data017] failed to send shard started to [[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}]
[2014-09-10 11:42:35,800][WARN ][cluster.action.shard     ] [buzzilla_data017] failed to send shard started to [[buzzilla_data009][1V_GNg4zRLaGfCULHtnPxQ][data009][inet[/10.1.1.9:9300]]{tag=gen1}]
[2014-09-10 11:42:50,829][DEBUG][action.admin.cluster.node.stats] [buzzilla_data017] failed to execute on node [dFFQV06aSA6gRmq5pAh5pg]
[2014-09-10 11:43:01,479][WARN ][transport                ] [buzzilla_data017] Received response for a request that has timed out, sent [54718ms] ago, timed out [24718ms] ago, action [discovery/zen/fd/ping], node [[buzzilla_data002][dFFQV06aSA6gRmq5pAh5pg][data002][inet[/10.10.5.139:9300]]{tag=gen1}], id [13696050]
[2014-09-10 11:43:01,483][WARN ][transport                ] [buzzilla_data017] Received response for a request that has timed out, sent [25654ms] ago, timed out [10654ms] ago, action [cluster/nodes/stats/n], node [[buzzilla_data002][dFFQV06aSA6gRmq5pAh5pg][data002][inet[/10.10.5.139:9300]]{tag=gen1}], id [13696563]

This file has been truncated. show original

data009 is the original master, data017 is the new master, and data008 is
where I found the empty index directory.

I had to delete the unassigned index from the cluster to return to green
state.
I am running Elasticsearch 1.2.1 in a 20 node cluster.

How does this happen? What can I do to prevent this from happening again?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CADdQPqz%3DXZDMbHC7zpWgEdaqW4Xy_VkX7EgRwfXsrJjuoQ50SA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Igor_Motov · September 12, 2014, 5:22pm

How were these nodes doing in terms of available heap space before the
disconnects occurred?

On Wednesday, September 10, 2014 6:26:19 AM UTC-4, Israel Tsadok wrote:

A temporary network disconnect of the master node caused a torrent of
RELOCATING shards, and then one shard remained UNASSIGNED and the cluster
state was left red.

looking inside the index directory for the shard on the disk, I found that
it was empty (i.e., the _state and translog dirs were there, but the index
dir had no files).

Looking at the log files, I see that the disconnect happened around
11:42:05, and a few minutes later I start seeing these error messages:

[2014-09-10 11:45:33,341][WARN ][indices.cluster ]
[buzzilla_data008] [el-2011-10-31-0000][0] failed to start shard
[2014-09-10 11:45:33,342][WARN ][cluster.action.shard ]
[buzzilla_data008] [el-2011-10-31-0000][0] sending failed shard for
[el-2011-10-31-0000][0], node[RAR26zfuTiKl4mdbRVTtNA], [P],
s[INITIALIZING], indexUUID [na], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] failed to fetch
index version after copying it over]; nested:
IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] shard allocated
for local recovery (post api), should exist, but doesn't, current files:
]; nested: IndexNotFoundException[no segments* file found in
store(least_used[rate_limited(mmapfs(/home/omgili/data/elasticsearch/data/buzzilla/nodes/0/indices/el-2011-10-31-0000/0/index),
type=MERGE, rate=20.0)]): files: ]; ]]

The relevant log files are at
Data loss on network disconnect · GitHub
data009 is the original master, data017 is the new master, and data008 is
where I found the empty index directory.

I had to delete the unassigned index from the cluster to return to green
state.
I am running Elasticsearch 1.2.1 in a 20 node cluster.

How does this happen? What can I do to prevent this from happening again?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/749729f6-daa1-470c-a835-d8f5dd85ad87%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
SOLVED: Unassigned shards after restart, allocated for local recovery, should exist but doesn't + no segments file found in store Elasticsearch	2	4152	July 5, 2017
{index_name}/{shard_id}/_state directory is missing and {index_name}/{shard_id}/index directory is empty (all index files are gone) after a few nodes were restarted at the same time Elasticsearch	1	365	July 6, 2017
Shard data is missing without any reason or log Elasticsearch	2	435	January 15, 2019
Recover after failure (lost shards) Elasticsearch	3	1822	July 6, 2017
Big issue Elasticsearch	5	326	July 6, 2017

Data loss after network disconnect

Related topics