Hi -- I am having significant problems with my small ES cluster. I'll accept that the problems are likely with my set up and not ES but I am on the verge of giving up....
One problem I am seeing is that every time I restart a node, or ES on a node crashes and is restarted, shard recovery takes days -- if it ever completes. I have had to wipe all of the data on my cluster and backfill several times when recovery doesn't complete. The shards that do not recover tend to be large: over 30GB. The error I am currently seeing in my logs suggests that there is a 900000ms recovery timeout which is preventing these shards from being transferred to the recovering node. (See error message below.)
My cluster is composed of two data nodes (96GB RAM, 24 cores) and three master nodes (no data, 64GB RAM, 16 cores) which are also used for logstash indexing. I am running ES 2.0. Each ES instance has 30GB allocated to it (this was reduced from something much larger after having read that Java does not handle heap allocations >32GB gracefully). Each index is usually configured to have one shard and one replica.
- what is the 900s timeout I am hitting?
- is it configurable, or should it be left alone and I should change something else with my cluster?
- are these indexes/shards unusual in size?
Caused by: RemoteTransportException[[muninn][132.246.195.227:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [190] files with total size of [95gb]]; nested: ReceiveTimeoutTransportException[[huginn][132.246.195.226:9300][internal:index/shard/recovery/prepare_translog] request_id [23511214] timed out after [900000ms]];
Caused by: [tomcat-svc-2015.02][[tomcat-svc-2015.02][0]] RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [190] files with total size of [95gb]]; nested: ReceiveTimeoutTransportException[[huginn][132.246.195.226:9300][internal:index/shard/recovery/prepare_translog] request_id [23511214] timed out after [900000ms]];
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:135)
at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:127)
at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:53)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:136)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:133)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)