This morning I arrived to find our test ES cluster in poor health...
Some background on our cluster...
Single ES index running on a single AWS Hi IO instance (60 GB RAM, 2TB SSD
storage, 16 virtual cores). Index had 8 shards, contained approximately 12
million documents, and is approximately 450 GB in size. Running
ElasticSearch 0.20.5, Java OpenJDK 1.6.0_24 (and I know I'd be better off
with Oracle 1.7, just haven't had the time to update our install scripts).
Yesterday afternoon we had been running some load tests against the index
that had successfully completed and the cluster was in good health when I
left for the day. When I returned in the morning, I was unable to run any
searches and after some digging, saw that the cluster was having
problems. At about the time of my first search of the morning, ES started
putting out log entries for the index as if it had just restarted. Note,
there were no log entries indicating problems overnight after the load test
ended. This morning, ES appeared to be thrashing over the various shards
trying to recover them. The start of today's log file.....
[2013-05-07 07:37:43,798][INFO ][node ]
[ip-10-148-148-93.ec2.internal] {0.20.5}[2261]: initializing ...
[2013-05-07 07:37:43,906][INFO ][plugins ]
[ip-10-148-148-93.ec2.internal] loaded [cloud-aws], sites [bigdesk, head,
paramedic]
[2013-05-07 07:37:47,209][INFO ][node ]
[ip-10-148-148-93.ec2.internal] {0.20.5}[2261]: initialized
[2013-05-07 07:37:47,209][INFO ][node ]
[ip-10-148-148-93.ec2.internal] {0.20.5}[2261]: starting ...
[2013-05-07 07:37:47,612][INFO ][transport ]
[ip-10-148-148-93.ec2.internal] bound_address {inet[/0.0.0.0:9300]},
publish_address {inet[/X.X.X.X:9300]}
[2013-05-07 07:37:51,739][INFO ][cluster.service ]
[ip-10-148-148-93.ec2.internal] new_master
[ip-10-148-148-93.ec2.internal][t5V-xrEnTp2FtCJxBJ5gPg][inet[/X.X.X.X:9300]]{max_local_storage_nodes=1},
reason: zen-disco-join (elected_as_master)
[2013-05-07 07:37:51,774][INFO ][discovery ]
[ip-10-148-148-93.ec2.internal] sdxcr/t5V-xrEnTp2FtCJxBJ5gPg
[2013-05-07 07:37:51,837][INFO ][http ]
[ip-10-148-148-93.ec2.internal] bound_address {inet[/0.0.0.0:9200]},
publish_address {inet[/X.X.X.X:9200]}
[2013-05-07 07:37:51,838][INFO ][node ]
[ip-10-148-148-93.ec2.internal] {0.20.5}[2261]: started
[2013-05-07 07:37:51,905][DEBUG][action.admin.indices.status]
[ip-X.X.X.X.ec2.internal] [sdxcr][0], node[t5V-xrEnTp2FtCJxBJ5gPg], [P],
s[INITIALIZING]: Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@305f387c]
org.elasticsearch.indices.IndexMissingException: [sdxcr] missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:244)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:152)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
[2013-05-07 07:37:51,930][DEBUG][action.admin.indices.status]
[ip-X.X.X.X.ec2.internal] [sdxcr][2], node[t5V-xrEnTp2FtCJxBJ5gPg], [P],
s[INITIALIZING]: Failed to execute [o
rg.elasticsearch.action.admin.indices.status.IndicesStatusRequest@305f387c]
org.elasticsearch.indices.IndexMissingException: [sdxcr] missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:244)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:152)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
...
eventually leading into repeating errors like these for the various shards:
[2013-05-07 07:37:52,958][DEBUG][action.admin.indices.status]
[ip-X.X.X.X.ec2.internal] [sdxcr][7], node[t5V-xrEnTp2FtCJxBJ5gPg], [P],
s[INITIALIZING]: Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@6446154e]
org.elasticsearch.index.IndexShardMissingException: [sdxcr][7] missing
at
org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:179)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:153)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
[2013-05-07 07:37:53,021][WARN ][indices.cluster ]
[ip-X.X.X.X.ec2.internal] [sdxcr][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[sdxcr][2] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:228)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException:
[sdxcr][2] failed to create engine
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:252)
at
org.elasticsearch.index.shard.service.InternalIndexShard.start(InternalIndexShard.java:279)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:182)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
... 3 more
Caused by: java.io.FileNotFoundException: _5d.fnm
at
org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:519)
at org.apache.lucene.index.FieldInfos.(FieldInfos.java:71)
at
org.apache.lucene.index.IndexWriter.getFieldInfos(IndexWriter.java:1212)
at
org.apache.lucene.index.IndexWriter.getCurrentFieldInfos(IndexWriter.java:1228)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1161)
at org.apache.lucene.index.XIndexWriter.(XIndexWriter.java:17)
at
org.elasticsearch.index.engine.robin.RobinEngine.createWriter(RobinEngine.java:1365)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:250)
... 6 more
This just goes on and on as it fails recovering a shard and tries to
recover a different one.
I was curious if anyone else had experienced a similar cluster failure and
had any suggestions on how to prevent it in the future.
Thanks for any insights
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.