"Shard ... should exists, but doesn't" errors

We have a cluster with approximately 200m documents divided between 19
indexes. These are sharded so there are approximately 1m documents per
shard, with no replicas. The data nodes in the cluster are hosted on 8 x
m2.2xlarge amazon EC2 instances.

One of the nodes appears to have had some kind of networking issue, and
temporarily left the cluster. Once it rejoined, it reports a lot of "marked
shard as started, but shard have not been created, mark shard as failed"
errors before eventually settling down leaving 4 unassigned shards, each
with the following error:

[2012-12-06 17:42:30,070][WARN ][indices.cluster ] [Richard Rider]
[docs-en-1][29] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[docs-en-1][29] shard allocated for local recovery (post api), should
exists, but doesn't

Folders existed on disk for these unassigned shards on the node, but they
had no data in them (du -sh reported size 0)

I don't think split-brain explains this, as we have 3 m1.small dataless
nodes configured to be masters, with discovery.zen.minimum_master_nodes: 2,
and there is a "not enough master nodes after master left" message in the
logs suggesting this setting is working correctly.

I would like to understand the cause of this issue, to prevent it happening
again. Also, if we had replicas, would elasticsearch have recovered
correctly from this situation?

--

Not sure why these shards disappeared, but adding replicas
would definitely help in this situations.

Which version of elasticasearch are you using?

On Tuesday, December 11, 2012 6:52:58 AM UTC-5, Robin Hughes wrote:

We have a cluster with approximately 200m documents divided between 19
indexes. These are sharded so there are approximately 1m documents per
shard, with no replicas. The data nodes in the cluster are hosted on 8 x
m2.2xlarge amazon EC2 instances.

One of the nodes appears to have had some kind of networking issue, and
temporarily left the cluster. Once it rejoined, it reports a lot of "marked
shard as started, but shard have not been created, mark shard as failed"
errors before eventually settling down leaving 4 unassigned shards, each
with the following error:

[2012-12-06 17:42:30,070][WARN ][indices.cluster ] [Richard
Rider] [docs-en-1][29] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[docs-en-1][29] shard allocated for local recovery (post api), should
exists, but doesn't

Folders existed on disk for these unassigned shards on the node, but they
had no data in them (du -sh reported size 0)

I don't think split-brain explains this, as we have 3 m1.small dataless
nodes configured to be masters, with discovery.zen.minimum_master_nodes: 2,
and there is a "not enough master nodes after master left" message in the
logs suggesting this setting is working correctly.

I would like to understand the cause of this issue, to prevent it
happening again. Also, if we had replicas, would elasticsearch have
recovered correctly from this situation?

--

Hi Igor

Thanks for the reply. We're using v0.19.8

On Thursday, December 13, 2012 9:21:08 PM UTC, Igor Motov wrote:

Not sure why these shards disappeared, but adding replicas
would definitely help in this situations.

Which version of elasticasearch are you using?

On Tuesday, December 11, 2012 6:52:58 AM UTC-5, Robin Hughes wrote:

We have a cluster with approximately 200m documents divided between 19
indexes. These are sharded so there are approximately 1m documents per
shard, with no replicas. The data nodes in the cluster are hosted on 8 x
m2.2xlarge amazon EC2 instances.

One of the nodes appears to have had some kind of networking issue, and
temporarily left the cluster. Once it rejoined, it reports a lot of "marked
shard as started, but shard have not been created, mark shard as failed"
errors before eventually settling down leaving 4 unassigned shards, each
with the following error:

[2012-12-06 17:42:30,070][WARN ][indices.cluster ] [Richard
Rider] [docs-en-1][29] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[docs-en-1][29] shard allocated for local recovery (post api), should
exists, but doesn't

Folders existed on disk for these unassigned shards on the node, but they
had no data in them (du -sh reported size 0)

I don't think split-brain explains this, as we have 3 m1.small dataless
nodes configured to be masters, with discovery.zen.minimum_master_nodes: 2,
and there is a "not enough master nodes after master left" message in the
logs suggesting this setting is working correctly.

I would like to understand the cause of this issue, to prevent it
happening again. Also, if we had replicas, would elasticsearch have
recovered correctly from this situation?

--