All new indexes created have unassigned shards

Using elasticsearch for logstash and each morning when the new index is created, the shards are all unassigned. I have to go in and manually assign them with:
curl -XPOST -d '{ "commands" : [ { "allocate" : { "index" : "logstash-2016.08.12", "shard" : 4, "node" : "data-node-1", "allow_primary":true } } ] }' http://localhost:9200/_cluster/reroute?pretty

I have set this as per some google research:
"routing":{"allocation":{"disable_allocation":"false"}

1 Like

You shouldn't need to manually reroute shards, something is misconfigured on your cluster.

Can you post the output of GET /_cluster/settings ? Also have you made any changes to your elasticsearch.yml?

this just started two days ago. it has been working fine. I have not made any changes recently to the elasticsearch.yml
-rw-rw-r-- 1 logstash logstash 3323 Jun 22 13:37 elasticsearch.yml

/_cluster/settings?pretty=true
{
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "all"
        }
      }
    }
  }
}

Hm, anything in your logs about low disk space, or warning in general?

Could you also paste your templates? GET /_template/

I'm having the same problem. Thought I would throw in my configs etc here too.

Similar to Unassigned shards, v2 (Unanswered).

Each morning at 9am (JST, which is UTC0) a new index for logstash is created in elasticsearch. At this time, these new shards are UNASSIGNED. They remain in this state until manually rerouted using the _cluster/reroute api. This can take a few minutes, or anywhere up to an hour to settle down, during which time, kibana has no access to information, and logstash begins erroring while elasticsearch is basically unavailable.

{
  "persistent": {

  },
  "transient": {
    "cluster": {
      "routing": {
        "allocation": {
          "enable": "all"
        }
      }
    }
  }
}

My elasticsearch.yml on all 3 nodes is the same, and basically unchanged:

cluster.name: prjsearch
node.name: stg-agselastic101z.stg.jp.local
node.max_local_storage_nodes: 1
path.conf: /etc/elasticsearch
path.data: /data/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.73.11.105:9300", "10.73.12.105:9300", "10.73.13.105:9300"]
discovery.zen.minimum_master_nodes: 2
gateway.expected_nodes: 0
http.cors.allow-origin: "*"
http.cors.enabled: true
network.publish_address: 10.73.11.105
node.data: true
node.master: true

Any help is appreciated. Its frustrating to have to do manual operations on this cluster every morning.

Which version of Elasticsearch are you running?

And what is the output of /_cluster/health during this time when shards are unassigned after creating the index?

Is there any information in the logs, and if so can you share the relevant snippets with us? If your cluster is staying in the RED state (unassigned primary shards), then there should be some relevant info in the logs to help diagnose the issue.

So the issue for me appeared to mainly be due to performance issues. Even though mlockall was set to true, it was still swapping. Took care of that, so verify you have swapping disabled. Another issue was the number of open files the user was allowed to open.

Sorry im not more help, but this has resolved a lot of my unassigned shards issue. The other piece will be adjusting the template and reducing number of shards.

ElasticSearch v2.3.3

Unfortunately I recovered teh cluster manually before checking for messages here, so I don't have the _cluster/health output while in the erroring state.

Here's one of the errors I can see from this mornings failure:

[2016-08-31 09:30:03,192][WARN ][indices.cluster          ] [stg-agselastic101z.stg.jp.local] [[filebeat-2016.08.31][2]] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[filebeat-2016.08.31][2]: Recovery failed from {stg-agselastic102z.stg.jp.local}{rsG97zspQ36d9qiSuLmGKg}{100.73.12.105}{100.73.12.105:9300}{max_local_storage_nodes=1, master=true} into {stg-agselastic101z.stg.jp.local}{r5iTS41XRAalRz_BAuYqaA}{100.73.11.105}{100.73.11.105:9300}{max_local_storage_nodes=1, master=true}]; nested: RemoteTransportException[[stg-agselastic102z.stg.jp.local][100.73.12.105:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [130b]]; nested: ReceiveTimeoutTransportException[[stg-agselastic101z.stg.jp.local][100.73.11.105:9300][internal:index/shard/recovery/filesInfo] request_id [1568794] timed out after [900000ms]];
	at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:258)
	at org.elasticsearch.indices.recovery.RecoveryTarget.access$1100(RecoveryTarget.java:69)
	at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:508)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: RemoteTransportException[[stg-agselastic102z.stg.jp.local][100.73.12.105:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [130b]]; nested: ReceiveTimeoutTransportException[[stg-agselastic101z.stg.jp.local][100.73.11.105:9300][internal:index/shard/recovery/filesInfo] request_id [1568794] timed out after [900000ms]];
Caused by: [filebeat-2016.08.31][[filebeat-2016.08.31][2]] RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [130b]]; nested: ReceiveTimeoutTransportException[[stg-agselastic101z.stg.jp.local][100.73.11.105:9300][internal:index/shard/recovery/filesInfo] request_id [1568794] timed out after [900000ms]];
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:135)
	at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:126)
	at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:52)
	at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:135)
	at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
	at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: [filebeat-2016.08.31][[filebeat-2016.08.31][2]] RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [130b]]; nested: ReceiveTimeoutTransportException[[stg-agselastic101z.stg.jp.local][100.73.11.105:9300][internal:index/shard/recovery/filesInfo] request_id [1568794] timed out after [900000ms]];
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:453)
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:133)
	... 11 more
Caused by: ReceiveTimeoutTransportException[[stg-agselastic101z.stg.jp.local][100.73.11.105:9300][internal:index/shard/recovery/filesInfo] request_id [1568794] timed out after [900000ms]]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:679)
	... 3 more

Did these errors happen when index filebeat-2016.08.31 was created? It seems you have a cluster restart happening and the cluster is trying to recover a shard in filebeat-2016.08.31 but it timed out trying to perform the recovery because the source of the recovery couldn't connect to the target node stg-agselastic101z.stg.jp.local][100.73.11.105:9300].

Any chance there were network connectivity issues?

Are these logs from your master? It would help to know what the names of the 3 nodes are in your cluster, and see the relevant logs for this error message from all 3 nodes.

I'll monitor the connection in my morning, and report logs from all 3 nodes. So, logs to come in a followup post.

These errors happen at exactly the time the filebeat-2016-08-31 index is created. Network connectivity should not be an issue, these are all within the same datacenter, on the same network.