Trouble restarting after crash


(Chuck McKenzie) #1

I've inherited a several large 18.7 elasticsearch clusters and I'm
having some trouble getting one to restart after a crash. (We ran out
of open filehandles.) I've since upped the limit, and I'll be
cleaning up old indices after it comes back up, so that shouldn't
happen again, but I can't get the cluster to finish starting.

Here's the problem I'm seeing:

[2012-05-07 13:11:04,926][WARN ][indices.cluster ]
[node_name] [shard_name][8] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[shard_name][8] shard allocated for local recovery (post api), should
exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

I've followed earlier instructions on this mailing list that say to
XDELETE the affected index, but that doesn't seem to be working - it's
been sitting for an hour as follows:

{

cluster_name: es_cluster1
status: red
timed_out: false
number_of_nodes: 4
number_of_data_nodes: 4
active_primary_shards: 5260
active_shards: 9719
relocating_shards: 0
initializing_shards: 6
unassigned_shards: 41

}

Any idea how I can get rid of the two tiny test indices that are
having problems, without deleting several TB of data from the other
indices?


(Rafał Kuć) #2

Hello!

We had similar issue to yours - did you try running XDELETE on more
then one nodes ? We had to run XDELETE on two nodes in the cluster to
actually have problematic indices deleted.

--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

I've inherited a several large 18.7 elasticsearch clusters and I'm
having some trouble getting one to restart after a crash. (We ran out
of open filehandles.) I've since upped the limit, and I'll be
cleaning up old indices after it comes back up, so that shouldn't
happen again, but I can't get the cluster to finish starting.

Here's the problem I'm seeing:

[2012-05-07 13:11:04,926][WARN ][indices.cluster ]
[node_name] [shard_name][8] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[shard_name][8] shard allocated for local recovery (post api), should
exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

I've followed earlier instructions on this mailing list that say to
XDELETE the affected index, but that doesn't seem to be working - it's
been sitting for an hour as follows:

{

cluster_name: es_cluster1
status: red
timed_out: false
number_of_nodes: 4
number_of_data_nodes: 4
active_primary_shards: 5260
active_shards: 9719
relocating_shards: 0
initializing_shards: 6
unassigned_shards: 41

}

Any idea how I can get rid of the two tiny test indices that are
having problems, without deleting several TB of data from the other
indices?


(Chuck McKenzie) #3

They're running against localhost on each of the 4 nodes. Doesn't
seem to help.

On May 7, 1:28 pm, Rafał Kuć r....@solr.pl wrote:

Hello!

We had similar issue to yours - did you try running XDELETE on more
then one nodes ? We had to run XDELETE on two nodes in the cluster to
actually have problematic indices deleted.

--
Regards,
Rafa³ Kuæ
Sematext ::http://sematext.com/:: Solr - Lucene - Nutch - ElasticSearch

I've inherited a several large 18.7 elasticsearch clusters and I'm
having some trouble getting one to restart after a crash. (We ran out
of open filehandles.) I've since upped the limit, and I'll be
cleaning up old indices after it comes back up, so that shouldn't
happen again, but I can't get the cluster to finish starting.
Here's the problem I'm seeing:
[2012-05-07 13:11:04,926][WARN ][indices.cluster ]
[node_name] [shard_name][8] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[shard_name][8] shard allocated for local recovery (post api), should
exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
I've followed earlier instructions on this mailing list that say to
XDELETE the affected index, but that doesn't seem to be working - it's
been sitting for an hour as follows:
{
cluster_name: es_cluster1
status: red
timed_out: false
number_of_nodes: 4
number_of_data_nodes: 4
active_primary_shards: 5260
active_shards: 9719
relocating_shards: 0
initializing_shards: 6
unassigned_shards: 41
}
Any idea how I can get rid of the two tiny test indices that are
having problems, without deleting several TB of data from the other
indices?


(Shay Banon) #4

DELETE the index will help to remove this message, this problem should be
fixed in 0.19 with the new local gateway structure and several bug fixes
(the fact that a shard can't recover).

On Mon, May 7, 2012 at 9:39 PM, Chuck McKenzie redchuck@gmail.com wrote:

They're running against localhost on each of the 4 nodes. Doesn't
seem to help.

On May 7, 1:28 pm, Rafał Kuć r....@solr.pl wrote:

Hello!

We had similar issue to yours - did you try running XDELETE on more
then one nodes ? We had to run XDELETE on two nodes in the cluster to
actually have problematic indices deleted.

--
Regards,
Rafa³ Kuæ
Sematext ::http://sematext.com/:: Solr - Lucene - Nutch - ElasticSearch

I've inherited a several large 18.7 elasticsearch clusters and I'm
having some trouble getting one to restart after a crash. (We ran out
of open filehandles.) I've since upped the limit, and I'll be
cleaning up old indices after it comes back up, so that shouldn't
happen again, but I can't get the cluster to finish starting.
Here's the problem I'm seeing:
[2012-05-07 13:11:04,926][WARN ][indices.cluster ]
[node_name] [shard_name][8] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[shard_name][8] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:

  1. at org.elasticsearch.index.gateway.IndexShardGatewayService
    

$1.run(IndexShardGatewayService.java:179)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
I've followed earlier instructions on this mailing list that say to
XDELETE the affected index, but that doesn't seem to be working - it's
been sitting for an hour as follows:
{
cluster_name: es_cluster1
status: red
timed_out: false
number_of_nodes: 4
number_of_data_nodes: 4
active_primary_shards: 5260
active_shards: 9719
relocating_shards: 0
initializing_shards: 6
unassigned_shards: 41
}
Any idea how I can get rid of the two tiny test indices that are
having problems, without deleting several TB of data from the other
indices?


(system) #5