Trouble restarting after crash

I've inherited a several large 18.7 elasticsearch clusters and I'm
having some trouble getting one to restart after a crash. (We ran out
of open filehandles.) I've since upped the limit, and I'll be
cleaning up old indices after it comes back up, so that shouldn't
happen again, but I can't get the cluster to finish starting.

Here's the problem I'm seeing:

[2012-05-07 13:11:04,926][WARN ][indices.cluster ]
[node_name] [shard_name][8] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[shard_name][8] shard allocated for local recovery (post api), should
exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

I've followed earlier instructions on this mailing list that say to
XDELETE the affected index, but that doesn't seem to be working - it's
been sitting for an hour as follows:

{

cluster_name: es_cluster1
status: red
timed_out: false
number_of_nodes: 4
number_of_data_nodes: 4
active_primary_shards: 5260
active_shards: 9719
relocating_shards: 0
initializing_shards: 6
unassigned_shards: 41

}

Any idea how I can get rid of the two tiny test indices that are
having problems, without deleting several TB of data from the other
indices?

Hello!

We had similar issue to yours - did you try running XDELETE on more
then one nodes ? We had to run XDELETE on two nodes in the cluster to
actually have problematic indices deleted.

--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Elasticsearch

I've inherited a several large 18.7 elasticsearch clusters and I'm
having some trouble getting one to restart after a crash. (We ran out
of open filehandles.) I've since upped the limit, and I'll be
cleaning up old indices after it comes back up, so that shouldn't
happen again, but I can't get the cluster to finish starting.

Here's the problem I'm seeing:

[2012-05-07 13:11:04,926][WARN ][indices.cluster ]
[node_name] [shard_name][8] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[shard_name][8] shard allocated for local recovery (post api), should
exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

I've followed earlier instructions on this mailing list that say to
XDELETE the affected index, but that doesn't seem to be working - it's
been sitting for an hour as follows:

{

cluster_name: es_cluster1
status: red
timed_out: false
number_of_nodes: 4
number_of_data_nodes: 4
active_primary_shards: 5260
active_shards: 9719
relocating_shards: 0
initializing_shards: 6
unassigned_shards: 41

}

Any idea how I can get rid of the two tiny test indices that are
having problems, without deleting several TB of data from the other
indices?

They're running against localhost on each of the 4 nodes. Doesn't
seem to help.

On May 7, 1:28 pm, Rafał Kuć r....@solr.pl wrote:

Hello!

We had similar issue to yours - did you try running XDELETE on more
then one nodes ? We had to run XDELETE on two nodes in the cluster to
actually have problematic indices deleted.

--
Regards,
Rafa³ Kuæ
Sematext ::http://sematext.com/:: Solr - Lucene - Nutch - Elasticsearch

I've inherited a several large 18.7 elasticsearch clusters and I'm
having some trouble getting one to restart after a crash. (We ran out
of open filehandles.) I've since upped the limit, and I'll be
cleaning up old indices after it comes back up, so that shouldn't
happen again, but I can't get the cluster to finish starting.
Here's the problem I'm seeing:
[2012-05-07 13:11:04,926][WARN ][indices.cluster ]
[node_name] [shard_name][8] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[shard_name][8] shard allocated for local recovery (post api), should
exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
I've followed earlier instructions on this mailing list that say to
XDELETE the affected index, but that doesn't seem to be working - it's
been sitting for an hour as follows:
{
cluster_name: es_cluster1
status: red
timed_out: false
number_of_nodes: 4
number_of_data_nodes: 4
active_primary_shards: 5260
active_shards: 9719
relocating_shards: 0
initializing_shards: 6
unassigned_shards: 41
}
Any idea how I can get rid of the two tiny test indices that are
having problems, without deleting several TB of data from the other
indices?

DELETE the index will help to remove this message, this problem should be
fixed in 0.19 with the new local gateway structure and several bug fixes
(the fact that a shard can't recover).

On Mon, May 7, 2012 at 9:39 PM, Chuck McKenzie redchuck@gmail.com wrote:

They're running against localhost on each of the 4 nodes. Doesn't
seem to help.

On May 7, 1:28 pm, Rafał Kuć r....@solr.pl wrote:

Hello!

We had similar issue to yours - did you try running XDELETE on more
then one nodes ? We had to run XDELETE on two nodes in the cluster to
actually have problematic indices deleted.

--
Regards,
Rafa³ Kuæ
Sematext ::http://sematext.com/:: Solr - Lucene - Nutch - Elasticsearch

I've inherited a several large 18.7 elasticsearch clusters and I'm
having some trouble getting one to restart after a crash. (We ran out
of open filehandles.) I've since upped the limit, and I'll be
cleaning up old indices after it comes back up, so that shouldn't
happen again, but I can't get the cluster to finish starting.
Here's the problem I'm seeing:
[2012-05-07 13:11:04,926][WARN ][indices.cluster ]
[node_name] [shard_name][8] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[shard_name][8] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:

  1. at org.elasticsearch.index.gateway.IndexShardGatewayService
    

$1.run(IndexShardGatewayService.java:179)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
I've followed earlier instructions on this mailing list that say to
XDELETE the affected index, but that doesn't seem to be working - it's
been sitting for an hour as follows:
{
cluster_name: es_cluster1
status: red
timed_out: false
number_of_nodes: 4
number_of_data_nodes: 4
active_primary_shards: 5260
active_shards: 9719
relocating_shards: 0
initializing_shards: 6
unassigned_shards: 41
}
Any idea how I can get rid of the two tiny test indices that are
having problems, without deleting several TB of data from the other
indices?