Cluster Health degraded overnight with no apparent reason

This morning I arrived to find our test ES cluster in poor health...

Some background on our cluster...

Single ES index running on a single AWS Hi IO instance (60 GB RAM, 2TB SSD
storage, 16 virtual cores). Index had 8 shards, contained approximately 12
million documents, and is approximately 450 GB in size. Running
ElasticSearch 0.20.5, Java OpenJDK 1.6.0_24 (and I know I'd be better off
with Oracle 1.7, just haven't had the time to update our install scripts).

Yesterday afternoon we had been running some load tests against the index
that had successfully completed and the cluster was in good health when I
left for the day. When I returned in the morning, I was unable to run any
searches and after some digging, saw that the cluster was having
problems. At about the time of my first search of the morning, ES started
putting out log entries for the index as if it had just restarted. Note,
there were no log entries indicating problems overnight after the load test
ended. This morning, ES appeared to be thrashing over the various shards
trying to recover them. The start of today's log file.....

[2013-05-07 07:37:43,798][INFO ][node ]
[ip-10-148-148-93.ec2.internal] {0.20.5}[2261]: initializing ...
[2013-05-07 07:37:43,906][INFO ][plugins ]
[ip-10-148-148-93.ec2.internal] loaded [cloud-aws], sites [bigdesk, head,
paramedic]
[2013-05-07 07:37:47,209][INFO ][node ]
[ip-10-148-148-93.ec2.internal] {0.20.5}[2261]: initialized
[2013-05-07 07:37:47,209][INFO ][node ]
[ip-10-148-148-93.ec2.internal] {0.20.5}[2261]: starting ...
[2013-05-07 07:37:47,612][INFO ][transport ]
[ip-10-148-148-93.ec2.internal] bound_address {inet[/0.0.0.0:9300]},
publish_address {inet[/X.X.X.X:9300]}
[2013-05-07 07:37:51,739][INFO ][cluster.service ]
[ip-10-148-148-93.ec2.internal] new_master
[ip-10-148-148-93.ec2.internal][t5V-xrEnTp2FtCJxBJ5gPg][inet[/X.X.X.X:9300]]{max_local_storage_nodes=1},
reason: zen-disco-join (elected_as_master)
[2013-05-07 07:37:51,774][INFO ][discovery ]
[ip-10-148-148-93.ec2.internal] sdxcr/t5V-xrEnTp2FtCJxBJ5gPg
[2013-05-07 07:37:51,837][INFO ][http ]
[ip-10-148-148-93.ec2.internal] bound_address {inet[/0.0.0.0:9200]},
publish_address {inet[/X.X.X.X:9200]}
[2013-05-07 07:37:51,838][INFO ][node ]
[ip-10-148-148-93.ec2.internal] {0.20.5}[2261]: started
[2013-05-07 07:37:51,905][DEBUG][action.admin.indices.status]
[ip-X.X.X.X.ec2.internal] [sdxcr][0], node[t5V-xrEnTp2FtCJxBJ5gPg], [P],
s[INITIALIZING]: Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@305f387c]
org.elasticsearch.indices.IndexMissingException: [sdxcr] missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:244)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:152)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
[2013-05-07 07:37:51,930][DEBUG][action.admin.indices.status]
[ip-X.X.X.X.ec2.internal] [sdxcr][2], node[t5V-xrEnTp2FtCJxBJ5gPg], [P],
s[INITIALIZING]: Failed to execute [o
rg.elasticsearch.action.admin.indices.status.IndicesStatusRequest@305f387c]
org.elasticsearch.indices.IndexMissingException: [sdxcr] missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:244)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:152)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)

...

eventually leading into repeating errors like these for the various shards:

[2013-05-07 07:37:52,958][DEBUG][action.admin.indices.status]
[ip-X.X.X.X.ec2.internal] [sdxcr][7], node[t5V-xrEnTp2FtCJxBJ5gPg], [P],
s[INITIALIZING]: Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@6446154e]
org.elasticsearch.index.IndexShardMissingException: [sdxcr][7] missing
at
org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:179)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:153)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
[2013-05-07 07:37:53,021][WARN ][indices.cluster ]
[ip-X.X.X.X.ec2.internal] [sdxcr][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[sdxcr][2] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:228)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException:
[sdxcr][2] failed to create engine
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:252)
at
org.elasticsearch.index.shard.service.InternalIndexShard.start(InternalIndexShard.java:279)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:182)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
... 3 more
Caused by: java.io.FileNotFoundException: _5d.fnm
at
org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:519)
at org.apache.lucene.index.FieldInfos.(FieldInfos.java:71)
at
org.apache.lucene.index.IndexWriter.getFieldInfos(IndexWriter.java:1212)
at
org.apache.lucene.index.IndexWriter.getCurrentFieldInfos(IndexWriter.java:1228)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1161)
at org.apache.lucene.index.XIndexWriter.(XIndexWriter.java:17)
at
org.elasticsearch.index.engine.robin.RobinEngine.createWriter(RobinEngine.java:1365)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:250)
... 6 more

This just goes on and on as it fails recovering a shard and tries to
recover a different one.

I was curious if anyone else had experienced a similar cluster failure and
had any suggestions on how to prevent it in the future.

Thanks for any insights

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Curt Kohler wrote:

This morning I arrived to find our test ES cluster in poor
health...

[...]

I was curious if anyone else had experienced a similar cluster
failure and
had any suggestions on how to prevent it in the future.

These kinds of problems are symptomatic of running out of disk
space or possibly experiencing OutOfMemoryExceptions. Do you have
any monitoring on the cluster that can provide any more clues as
to what happened during the night?

Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Unfortunately nothing beyond bigdesk/head, and nothing machine
level... The cluster is just a "play around with instance" to look at
the Amazon HiIO instance types with ElasticSearch.

I saw a number of postings pointing to #Files open, memory, or disk
space as potential sources of similar exceptions. I'm not sure this
would fall under those categories as the problem occurred some point
after we ran our 4 hour load test (which had no problems and didn't
run out of memory). After that, the cluster would basically have been
idle until I came back in the morning. The machine still has over 1TB
of disk space available under the data directories so that is probably
not the problem either. I need to go back and check, but I'm pretty
sure the file descriptors are set to be unlimited for the process when
we start it up.

Thanks for the suggestions...

Curt

On May 7, 10:22 am, Drew Raines aarai...@gmail.com wrote:

Curt Kohler wrote:

This morning I arrived to find our test ES cluster in poor
health...

[...]

I was curious if anyone else had experienced a similar cluster
failure and
had any suggestions on how to prevent it in the future.

These kinds of problems are symptomatic of running out of disk
space or possibly experiencing OutOfMemoryExceptions. Do you have
any monitoring on the cluster that can provide any more clues as
to what happened during the night?

Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Curt,

OpenJDK? Not real Sun/Oracle Java?

For reasons beyond my own control, I've run ES 0.20.4 and now 0.90.0 on Sun
Java 6, not 7. But it's real Java, not OpenJDK.

On a Solaris 3-node cluster:

-bash-3.00$ java -version
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) Client VM (build 16.0-b13, mixed mode, sharing)

And on my Mac:

$ java -version
java version "1.6.0_43"
Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-10M4203)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)

I haven't had any issues with this Oracle's Java 6 (above).

On Tuesday, May 7, 2013 9:28:14 AM UTC-4, Curt Kohler wrote:

This morning I arrived to find our test ES cluster in poor health...

Some background on our cluster...

Single ES index running on a single AWS Hi IO instance (60 GB RAM, 2TB SSD
storage, 16 virtual cores). Index had 8 shards, contained approximately 12
million documents, and is approximately 450 GB in size. Running
ElasticSearch 0.20.5, Java OpenJDK 1.6.0_24 (and I know I'd be better
off with Oracle 1.7, just haven't had the time to update our install
scripts).

Yesterday afternoon we had been running some load tests against the index
that had successfully completed and the cluster was in good health when I
left for the day. When I returned in the morning, I was unable to run any
searches and after some digging, saw that the cluster was having
problems. At about the time of my first search of the morning, ES started
putting out log entries for the index as if it had just restarted. Note,
there were no log entries indicating problems overnight after the load test
ended. This morning, ES appeared to be thrashing over the various shards
trying to recover them. The start of today's log file.....
...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Curt,

It won't help you now, but will in the future - like Drew said, put some
monitoring with historical trending on that ES cluster. Try SPM for ES
(see signature), we use it for ES and a bunch of other things.
Coincidentally, we're doing some ES stress testing now and using it to
understand what's happening. There is little 'share' icon next to each
graph, so you can share any graphs you want people here to see.

Something went wrong at Lucene-level here, judging from that last exception
in your paste. Was that an indexing stress test or a search stress test?

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html

On Tuesday, May 7, 2013 10:22:49 AM UTC-4, Drew Raines wrote:

Curt Kohler wrote:

This morning I arrived to find our test ES cluster in poor
health...

[...]

I was curious if anyone else had experienced a similar cluster
failure and
had any suggestions on how to prevent it in the future.

These kinds of problems are symptomatic of running out of disk
space or possibly experiencing OutOfMemoryExceptions. Do you have
any monitoring on the cluster that can provide any more clues as
to what happened during the night?

Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.