Unable to recover cluster after OOM

phobos182 · December 1, 2011, 6:24am

Getting a ton of these in my log files. Cluster crashed on an OOM event.

de[2JqaayEqTfSwxkGSwicsfg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[test_20110601-20110630_2011_22][4] shard allocated for local recovery (post api), should exists, but doesn't]]]
[2011-12-01 00:22:02,564][WARN ][cluster.action.shard ] [Arcade] sending failed shard for [test_20110601-20110630_2011_24][14], node[2JqaayEqTfSwxkGSwicsfg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[test_20110601-20110630_2011_24][14] shard allocated for local recovery (post api), should exists, but doesn't]]]

Now i'm trying to just delete that index and move on since it's a test index, but the cluster is not letting the delete command go through.

curl -XDELETE 'http://es1.colo:9200/test_20110601-20110630_2011_24'

After pressing enter with this command, the terminal just hangs. The command does not go through. How do I delete this index and recover my cluster. It's currently RED.

phobos182 · December 1, 2011, 6:38am

Forgot to include my cluster health.

{
cluster_name: es_cluster1
status: red
timed_out: false
number_of_nodes: 20
number_of_data_nodes: 20
active_primary_shards: 959
active_shards: 959
relocating_shards: 0
initializing_shards: 52
unassigned_shards: 29
}

phobos182 · December 1, 2011, 3:31pm

Any advice would be appreciated. I can turn on debug level logging to see if there is any error messages on the cluster if need be.

kimchy · December 1, 2011, 4:09pm

Which version are you using? The delete API should eventually succeed and
delete the index, its strange that you can't delete it. Can you give it
another go?

Also, do you have a stack trace of the OOM from the logs? Can you gist it?

On Thu, Dec 1, 2011 at 5:31 PM, phobos182 phobos182@gmail.com wrote:

Any advice would be appreciated. I can turn on debug level logging to see
if
there is any error messages on the cluster if need be.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Unable-to-recover-cluster-after-OOM-tp3550688p3551828.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

phobos182 · December 1, 2011, 4:50pm

No problem. Using 18.3. I do not have the OOM log anymore (It was caused by a very, very bad facet query that took all of the RAM on the cluster).

Going to turn up Debug level logging everywhere, and try the DELETE command.

phobos182 · December 1, 2011, 7:29pm

Just ran TRACE level logging, and sent the delete command. Nothing shows up in the logs.

I have a loop running over and over again putting this message in the logs.

[2011-12-01 13:28:21,935][DEBUG][action.admin.indices.status] [Mockingbird] [test_20110601-20110630_2011_26][2], node[LgkDg8hHR-OLiB9u3y4gkg], [P], s[INITIALIZING]: Failed to execute [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@6d3d2b91]
org.elasticsearch.transport.RemoteTransportException: [Blink][inet[/192.168.200.110:9300]][indices/status/shard]
Caused by: org.elasticsearch.index.IndexShardMissingException: [test_20110601-20110630_2011_26][2] missing
at org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:177)
at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:135)
at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:58)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:382)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:371)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:246)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-12-01 13:28:21,939][DEBUG][action.admin.indices.status] [Mockingbird] [test_20110601-20110630_2011_25][11], node[wQCvDYuOQXi8onkrGH71iw], [P], s[INITIALIZING]: Failed to execute [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@6d3d2b91]
org.elasticsearch.transport.RemoteTransportException: [Margali Szardos][inet[/192.168.200.177:9300]][indices/status/shard]
Caused by: org.elasticsearch.index.IndexShardMissingException: [test_20110601-20110630_2011_25][11] missing
at org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:177)
at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:135)
at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:58)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:382)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:371)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:246)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

kimchy · December 1, 2011, 7:42pm

Are you calling it? Cause elasticsearch doesn't call it..., and you can
ignore the failure...

On Thu, Dec 1, 2011 at 9:29 PM, phobos182 phobos182@gmail.com wrote:

Just ran TRACE level logging, and sent the delete command. Nothing shows up
in the logs.

I have a loop running over and over again putting this message in the logs.

[2011-12-01 13:28:21,935][DEBUG][action.admin.indices.status] [Mockingbird]
[test_20110601-20110630_2011_26][2], node[LgkDg8hHR-OLiB9u3y4gkg], [P],
s[INITIALIZING]: Failed to execute

[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@6d3d2b91
]
org.elasticsearch.transport.RemoteTransportException:
[Blink][inet[/192.168.200.110:9300]][indices/status/shard]
Caused by: org.elasticsearch.index.IndexShardMissingException:
[test_20110601-20110630_2011_26][2] missing
at

org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:177)
at

org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:135)
at

org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:58)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:382)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:371)
at

org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:246)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-12-01 13:28:21,939][DEBUG][action.admin.indices.status] [Mockingbird]
[test_20110601-20110630_2011_25][11], node[wQCvDYuOQXi8onkrGH71iw], [P],
s[INITIALIZING]: Failed to execute

[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@6d3d2b91
]
org.elasticsearch.transport.RemoteTransportException: [Margali
Szardos][inet[/192.168.200.177:9300]][indices/status/shard]
Caused by: org.elasticsearch.index.IndexShardMissingException:
[test_20110601-20110630_2011_25][11] missing
at

org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:177)
at

org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:135)
at

org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:58)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:382)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:371)
at

org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:246)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Unable-to-recover-cluster-after-OOM-tp3550688p3552589.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

phobos182 · December 2, 2011, 3:59pm

Putting this down for now. I had to get the cluster back up and running to I erased the entire thing, and started reindexing. When I have time I'll try to re-create with a development cluster.

I'm thinking I can create a few indexes, and behind the scenes stop the cluster, and rm -rf a shard and see if I can recreate the bug.

Topic		Replies	Views
Trouble restarting after crash Elasticsearch	4	718	July 6, 2017
Shard allocated for local recovery (post api), should exists, but doesn't Elasticsearch	1	724	July 6, 2017
Gateway recovery exception Elasticsearch	8	1125	July 6, 2017
Very weird ES Cluster state problem! Elasticsearch	8	501	July 6, 2017
Cluster Failure Elasticsearch	2	240	July 6, 2017

Unable to recover cluster after OOM

Related topics