Primary shard not available

Hi

I am facing issues because the primary shard is not available. What could be the reason for it? Please suggest.

Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-kibana-6-2018.12.20][0] primary shard is not active
Timeout: [1m], request: [BulkShardRequest [[.monitoring-kibana-6-2018.12.20][0]] containing [index {[.monitoring-kibana-6-2018.12
.20][doc][KbugymcBNIueI7r1ErSY], source[{"cluster_uuid":"lqoa3HJhRPyUAQUkLgFeqw","timestamp":"2018-12-20T07:59:07.413Z","interval
ms":10000,"type":"kibana_stats","source_node":{"uuid":"CiXHakymT_6VN-KMuIKsfQ","host":"myhost","tran
sport_address":"10.240.0.14:9401","ip":"10.240.0.14","name":"elkkibana01.pod","timestamp":"2018-12-20T07:59:07.413Z"},"kibana

stats":{"kibana":{"uuid":"1a54c2b0-78e8-4d76-9dc3-6dd93e9a8f67","name":"myhost","index":".kibana","ho
st":"0","transport_address":"0:9018","version":"6.4.2","snapshot":false,"status":"green"},"usage":{"xpack":{"reporting":{"availab
le":true,"enabled":true,"browser_type":"phantom","_all":0,"csv":{"available":true,"total":0},"printable_pdf":{"available":false,"
total":0},"status":{},"lastDay":{"_all":0,"csv":{"available":true,"total":0},"printable_pdf":{"available":false,"total":0},"statu
s":{}},"last7Days":{"_all":0,"csv":{"available":true,"total":0},"printable_pdf":{"available":false,"total":0},"status":{}}}}}}}]}
]]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(Transport
ReplicationAction.java:927) ~[elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportRepli
cationAction.java:773) ~[elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:726) ~[elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:887) ~[elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) ~[elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) ~[elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:573) ~[elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) ~[elasticsearch-6.4.2.jar:6.4.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_181]
... 1 more

Regards,

This can be answered with the allocation explain API:

GET /_cluster/allocation/explain
{
  "index": ".monitoring-kibana-6-2018.12.20",
  "shard": 0,
  "primary": true
}

If the result of that call is unclear, please share it here for further help.

2 Likes

Thank you for the reply. The explanation I got is as follows:

"reached the limit of ongoing initial primary recoveries [4], cluster setting [cluster.routing.allocation.node_initial_primaries_recoveries=4]"

I am considering increasing the recoveries count. What would be the repercussions of it?

Running too many recoveries in parallel can result in some or all of them timing out and failing, and can consume too many cluster resources such as bandwidth.

However, primary recoveries are normally quite quick, so I'm a bit surprised that your cluster is stuck here. I'm curious about what other recoveries are taking place preventing this one. Can you share the output of GET _cat/recovery?

https://dpaste.de/MPT6 is the output.

I do not see any incomplete recoveries in that output.

However, you seem to have far too many shards for the amount of data you are dealing with. You have multiple daily indices, each with 15 shards, with many shards are smaller than 1MB in size and no shards larger than 400MB. This will certainly have an impact on your cluster performance. This article gives advice on sharding, but the main point is you should aim for shards to be around 40GB in size. I think you could reasonably reduce the number_of_shards parameter to 1 on all these indices, and extend some of the daily indices to be weekly or monthly instead.

1 Like

I am working on your advice about the shards. However, my primary question is that if there are no incomplete recoveries why the exception about primary shard not being available. I got a repro of the problem today so I posted the details.

Ok, a bit more context would help! It wasn't clear that there was any issue, and remains unclear that it's still the same issue.

Which shard is reported as unavailable? What does the allocation explain API say about it?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.