A temporary network disconnect of the master node caused a torrent of
RELOCATING shards, and then one shard remained UNASSIGNED and the cluster
state was left red.
looking inside the index directory for the shard on the disk, I found that
it was empty (i.e., the _state and translog dirs were there, but the index
dir had no files).
Looking at the log files, I see that the disconnect happened around
11:42:05, and a few minutes later I start seeing these error messages:
[2014-09-10 11:45:33,341][WARN ][indices.cluster ]
[buzzilla_data008] [el-2011-10-31-0000][0] failed to start shard [2014-09-10 11:45:33,342][WARN ][cluster.action.shard ]
[buzzilla_data008] [el-2011-10-31-0000][0] sending failed shard for
[el-2011-10-31-0000][0], node[RAR26zfuTiKl4mdbRVTtNA], [P],
s[INITIALIZING], indexUUID [na], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] failed to fetch
index version after copying it over]; nested:
IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] shard allocated
for local recovery (post api), should exist, but doesn't, current files:
[]]; nested: IndexNotFoundException[no segments* file found in
store(least_used[rate_limited(mmapfs(/home/omgili/data/elasticsearch/data/buzzilla/nodes/0/indices/el-2011-10-31-0000/0/index),
type=MERGE, rate=20.0)]): files: []]; ]]
The relevant log files are at
data009 is the original master, data017 is the new master, and data008 is
where I found the empty index directory.
I had to delete the unassigned index from the cluster to return to green
state.
I am running Elasticsearch 1.2.1 in a 20 node cluster.
How does this happen? What can I do to prevent this from happening again?
How were these nodes doing in terms of available heap space before the
disconnects occurred?
On Wednesday, September 10, 2014 6:26:19 AM UTC-4, Israel Tsadok wrote:
A temporary network disconnect of the master node caused a torrent of
RELOCATING shards, and then one shard remained UNASSIGNED and the cluster
state was left red.
looking inside the index directory for the shard on the disk, I found that
it was empty (i.e., the _state and translog dirs were there, but the index
dir had no files).
Looking at the log files, I see that the disconnect happened around
11:42:05, and a few minutes later I start seeing these error messages:
[2014-09-10 11:45:33,341][WARN ][indices.cluster ]
[buzzilla_data008] [el-2011-10-31-0000][0] failed to start shard [2014-09-10 11:45:33,342][WARN ][cluster.action.shard ]
[buzzilla_data008] [el-2011-10-31-0000][0] sending failed shard for
[el-2011-10-31-0000][0], node[RAR26zfuTiKl4mdbRVTtNA], [P],
s[INITIALIZING], indexUUID [na], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] failed to fetch
index version after copying it over]; nested:
IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] shard allocated
for local recovery (post api), should exist, but doesn't, current files:
]; nested: IndexNotFoundException[no segments* file found in
store(least_used[rate_limited(mmapfs(/home/omgili/data/elasticsearch/data/buzzilla/nodes/0/indices/el-2011-10-31-0000/0/index),
type=MERGE, rate=20.0)]): files: ]; ]]
The relevant log files are at Data loss on network disconnect · GitHub
data009 is the original master, data017 is the new master, and data008 is
where I found the empty index directory.
I had to delete the unassigned index from the cluster to return to green
state.
I am running Elasticsearch 1.2.1 in a 20 node cluster.
How does this happen? What can I do to prevent this from happening again?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.