Data loss after network disconnect

A temporary network disconnect of the master node caused a torrent of
RELOCATING shards, and then one shard remained UNASSIGNED and the cluster
state was left red.

looking inside the index directory for the shard on the disk, I found that
it was empty (i.e., the _state and translog dirs were there, but the index
dir had no files).

Looking at the log files, I see that the disconnect happened around
11:42:05, and a few minutes later I start seeing these error messages:

[2014-09-10 11:45:33,341][WARN ][indices.cluster ]
[buzzilla_data008] [el-2011-10-31-0000][0] failed to start shard
[2014-09-10 11:45:33,342][WARN ][cluster.action.shard ]
[buzzilla_data008] [el-2011-10-31-0000][0] sending failed shard for
[el-2011-10-31-0000][0], node[RAR26zfuTiKl4mdbRVTtNA], [P],
s[INITIALIZING], indexUUID [na], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] failed to fetch
index version after copying it over]; nested:
IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] shard allocated
for local recovery (post api), should exist, but doesn't, current files:
[]]; nested: IndexNotFoundException[no segments* file found in
store(least_used[rate_limited(mmapfs(/home/omgili/data/elasticsearch/data/buzzilla/nodes/0/indices/el-2011-10-31-0000/0/index),
type=MERGE, rate=20.0)]): files: []]; ]]

The relevant log files are at


data009 is the original master, data017 is the new master, and data008 is
where I found the empty index directory.

I had to delete the unassigned index from the cluster to return to green
state.
I am running Elasticsearch 1.2.1 in a 20 node cluster.

How does this happen? What can I do to prevent this from happening again?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CADdQPqz%3DXZDMbHC7zpWgEdaqW4Xy_VkX7EgRwfXsrJjuoQ50SA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

1 Like

How were these nodes doing in terms of available heap space before the
disconnects occurred?

On Wednesday, September 10, 2014 6:26:19 AM UTC-4, Israel Tsadok wrote:

A temporary network disconnect of the master node caused a torrent of
RELOCATING shards, and then one shard remained UNASSIGNED and the cluster
state was left red.

looking inside the index directory for the shard on the disk, I found that
it was empty (i.e., the _state and translog dirs were there, but the index
dir had no files).

Looking at the log files, I see that the disconnect happened around
11:42:05, and a few minutes later I start seeing these error messages:

[2014-09-10 11:45:33,341][WARN ][indices.cluster ]
[buzzilla_data008] [el-2011-10-31-0000][0] failed to start shard
[2014-09-10 11:45:33,342][WARN ][cluster.action.shard ]
[buzzilla_data008] [el-2011-10-31-0000][0] sending failed shard for
[el-2011-10-31-0000][0], node[RAR26zfuTiKl4mdbRVTtNA], [P],
s[INITIALIZING], indexUUID [na], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] failed to fetch
index version after copying it over]; nested:
IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] shard allocated
for local recovery (post api), should exist, but doesn't, current files:
[]]; nested: IndexNotFoundException[no segments* file found in
store(least_used[rate_limited(mmapfs(/home/omgili/data/elasticsearch/data/buzzilla/nodes/0/indices/el-2011-10-31-0000/0/index),
type=MERGE, rate=20.0)]): files: []]; ]]

The relevant log files are at
https://gist.github.com/itsadok/97453743d6b211681aca
data009 is the original master, data017 is the new master, and data008 is
where I found the empty index directory.

I had to delete the unassigned index from the cluster to return to green
state.
I am running Elasticsearch 1.2.1 in a 20 node cluster.

How does this happen? What can I do to prevent this from happening again?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/749729f6-daa1-470c-a835-d8f5dd85ad87%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.