Unable to recover cluster after OOM


(phobos182) #1

Getting a ton of these in my log files. Cluster crashed on an OOM event.

de[2JqaayEqTfSwxkGSwicsfg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[test_20110601-20110630_2011_22][4] shard allocated for local recovery (post api), should exists, but doesn't]]]
[2011-12-01 00:22:02,564][WARN ][cluster.action.shard ] [Arcade] sending failed shard for [test_20110601-20110630_2011_24][14], node[2JqaayEqTfSwxkGSwicsfg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[test_20110601-20110630_2011_24][14] shard allocated for local recovery (post api), should exists, but doesn't]]]

Now i'm trying to just delete that index and move on since it's a test index, but the cluster is not letting the delete command go through.

curl -XDELETE 'http://es1.colo:9200/test_20110601-20110630_2011_24'

After pressing enter with this command, the terminal just hangs. The command does not go through. How do I delete this index and recover my cluster. It's currently RED.


(phobos182) #2

Forgot to include my cluster health.

{
cluster_name: es_cluster1
status: red
timed_out: false
number_of_nodes: 20
number_of_data_nodes: 20
active_primary_shards: 959
active_shards: 959
relocating_shards: 0
initializing_shards: 52
unassigned_shards: 29
}


(phobos182) #3

Any advice would be appreciated. I can turn on debug level logging to see if there is any error messages on the cluster if need be.


(Shay Banon) #4

Which version are you using? The delete API should eventually succeed and
delete the index, its strange that you can't delete it. Can you give it
another go?

Also, do you have a stack trace of the OOM from the logs? Can you gist it?

On Thu, Dec 1, 2011 at 5:31 PM, phobos182 phobos182@gmail.com wrote:

Any advice would be appreciated. I can turn on debug level logging to see
if
there is any error messages on the cluster if need be.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Unable-to-recover-cluster-after-OOM-tp3550688p3551828.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(phobos182) #5

No problem. Using 18.3. I do not have the OOM log anymore (It was caused by a very, very bad facet query that took all of the RAM on the cluster).

Going to turn up Debug level logging everywhere, and try the DELETE command.


(phobos182) #6

Just ran TRACE level logging, and sent the delete command. Nothing shows up in the logs.

I have a loop running over and over again putting this message in the logs.

[2011-12-01 13:28:21,935][DEBUG][action.admin.indices.status] [Mockingbird] [test_20110601-20110630_2011_26][2], node[LgkDg8hHR-OLiB9u3y4gkg], [P], s[INITIALIZING]: Failed to execute [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@6d3d2b91]
org.elasticsearch.transport.RemoteTransportException: [Blink][inet[/192.168.200.110:9300]][indices/status/shard]
Caused by: org.elasticsearch.index.IndexShardMissingException: [test_20110601-20110630_2011_26][2] missing
at org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:177)
at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:135)
at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:58)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:382)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:371)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:246)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-12-01 13:28:21,939][DEBUG][action.admin.indices.status] [Mockingbird] [test_20110601-20110630_2011_25][11], node[wQCvDYuOQXi8onkrGH71iw], [P], s[INITIALIZING]: Failed to execute [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@6d3d2b91]
org.elasticsearch.transport.RemoteTransportException: [Margali Szardos][inet[/192.168.200.177:9300]][indices/status/shard]
Caused by: org.elasticsearch.index.IndexShardMissingException: [test_20110601-20110630_2011_25][11] missing
at org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:177)
at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:135)
at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:58)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:382)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:371)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:246)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)


(Shay Banon) #7

Are you calling it? Cause elasticsearch doesn't call it..., and you can
ignore the failure...

On Thu, Dec 1, 2011 at 9:29 PM, phobos182 phobos182@gmail.com wrote:

Just ran TRACE level logging, and sent the delete command. Nothing shows up
in the logs.

I have a loop running over and over again putting this message in the logs.

[2011-12-01 13:28:21,935][DEBUG][action.admin.indices.status] [Mockingbird]
[test_20110601-20110630_2011_26][2], node[LgkDg8hHR-OLiB9u3y4gkg], [P],
s[INITIALIZING]: Failed to execute

[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@6d3d2b91
]
org.elasticsearch.transport.RemoteTransportException:
[Blink][inet[/192.168.200.110:9300]][indices/status/shard]
Caused by: org.elasticsearch.index.IndexShardMissingException:
[test_20110601-20110630_2011_26][2] missing
at

org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:177)
at

org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:135)
at

org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:58)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:382)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:371)
at

org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:246)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-12-01 13:28:21,939][DEBUG][action.admin.indices.status] [Mockingbird]
[test_20110601-20110630_2011_25][11], node[wQCvDYuOQXi8onkrGH71iw], [P],
s[INITIALIZING]: Failed to execute

[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@6d3d2b91
]
org.elasticsearch.transport.RemoteTransportException: [Margali
Szardos][inet[/192.168.200.177:9300]][indices/status/shard]
Caused by: org.elasticsearch.index.IndexShardMissingException:
[test_20110601-20110630_2011_25][11] missing
at

org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:177)
at

org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:135)
at

org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:58)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:382)
at

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:371)
at

org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:246)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Unable-to-recover-cluster-after-OOM-tp3550688p3552589.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(phobos182) #8

Putting this down for now. I had to get the cluster back up and running to I erased the entire thing, and started reindexing. When I have time I'll try to re-create with a development cluster.

I'm thinking I can create a few indexes, and behind the scenes stop the cluster, and rm -rf a shard and see if I can recreate the bug.


(system) #9