Using elasticsearch for logstash and each morning when the new index is created, the shards are all unassigned. I have to go in and manually assign them with:
curl -XPOST -d '{ "commands" : [ { "allocate" : { "index" : "logstash-2016.08.12", "shard" : 4, "node" : "data-node-1", "allow_primary":true } } ] }' http://localhost:9200/_cluster/reroute?pretty
I have set this as per some google research:
"routing":{"allocation":{"disable_allocation":"false"}
this just started two days ago. it has been working fine. I have not made any changes recently to the elasticsearch.yml
-rw-rw-r-- 1 logstash logstash 3323 Jun 22 13:37 elasticsearch.yml
Each morning at 9am (JST, which is UTC0) a new index for logstash is created in elasticsearch. At this time, these new shards are UNASSIGNED. They remain in this state until manually rerouted using the _cluster/reroute api. This can take a few minutes, or anywhere up to an hour to settle down, during which time, kibana has no access to information, and logstash begins erroring while elasticsearch is basically unavailable.
And what is the output of /_cluster/health during this time when shards are unassigned after creating the index?
Is there any information in the logs, and if so can you share the relevant snippets with us? If your cluster is staying in the RED state (unassigned primary shards), then there should be some relevant info in the logs to help diagnose the issue.
So the issue for me appeared to mainly be due to performance issues. Even though mlockall was set to true, it was still swapping. Took care of that, so verify you have swapping disabled. Another issue was the number of open files the user was allowed to open.
Sorry im not more help, but this has resolved a lot of my unassigned shards issue. The other piece will be adjusting the template and reducing number of shards.
Unfortunately I recovered teh cluster manually before checking for messages here, so I don't have the _cluster/health output while in the erroring state.
Here's one of the errors I can see from this mornings failure:
[2016-08-31 09:30:03,192][WARN ][indices.cluster ] [stg-agselastic101z.stg.jp.local] [[filebeat-2016.08.31][2]] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[filebeat-2016.08.31][2]: Recovery failed from {stg-agselastic102z.stg.jp.local}{rsG97zspQ36d9qiSuLmGKg}{100.73.12.105}{100.73.12.105:9300}{max_local_storage_nodes=1, master=true} into {stg-agselastic101z.stg.jp.local}{r5iTS41XRAalRz_BAuYqaA}{100.73.11.105}{100.73.11.105:9300}{max_local_storage_nodes=1, master=true}]; nested: RemoteTransportException[[stg-agselastic102z.stg.jp.local][100.73.12.105:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [130b]]; nested: ReceiveTimeoutTransportException[[stg-agselastic101z.stg.jp.local][100.73.11.105:9300][internal:index/shard/recovery/filesInfo] request_id [1568794] timed out after [900000ms]];
at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:258)
at org.elasticsearch.indices.recovery.RecoveryTarget.access$1100(RecoveryTarget.java:69)
at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:508)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: RemoteTransportException[[stg-agselastic102z.stg.jp.local][100.73.12.105:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [130b]]; nested: ReceiveTimeoutTransportException[[stg-agselastic101z.stg.jp.local][100.73.11.105:9300][internal:index/shard/recovery/filesInfo] request_id [1568794] timed out after [900000ms]];
Caused by: [filebeat-2016.08.31][[filebeat-2016.08.31][2]] RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [130b]]; nested: ReceiveTimeoutTransportException[[stg-agselastic101z.stg.jp.local][100.73.11.105:9300][internal:index/shard/recovery/filesInfo] request_id [1568794] timed out after [900000ms]];
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:135)
at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:126)
at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:52)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:135)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [filebeat-2016.08.31][[filebeat-2016.08.31][2]] RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [130b]]; nested: ReceiveTimeoutTransportException[[stg-agselastic101z.stg.jp.local][100.73.11.105:9300][internal:index/shard/recovery/filesInfo] request_id [1568794] timed out after [900000ms]];
at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:453)
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:133)
... 11 more
Caused by: ReceiveTimeoutTransportException[[stg-agselastic101z.stg.jp.local][100.73.11.105:9300][internal:index/shard/recovery/filesInfo] request_id [1568794] timed out after [900000ms]]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:679)
... 3 more
Did these errors happen when index filebeat-2016.08.31 was created? It seems you have a cluster restart happening and the cluster is trying to recover a shard in filebeat-2016.08.31 but it timed out trying to perform the recovery because the source of the recovery couldn't connect to the target node stg-agselastic101z.stg.jp.local][100.73.11.105:9300].
Any chance there were network connectivity issues?
Are these logs from your master? It would help to know what the names of the 3 nodes are in your cluster, and see the relevant logs for this error message from all 3 nodes.
I'll monitor the connection in my morning, and report logs from all 3 nodes. So, logs to come in a followup post.
These errors happen at exactly the time the filebeat-2016-08-31 index is created. Network connectivity should not be an issue, these are all within the same datacenter, on the same network.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.