Too many unassigned replica shards everyday

sbanagiri · December 20, 2017, 5:44am

Hello,

We have a ELK cluster setup with 35 nodes (3 masters and 32 data nodes).
New indices get created everyday once.
Approximately, 280 indices are created everyday.
Number of shards per index is 24 with replication factor of 1.
The disk size of each node is 15TB and RAM is 125GB.

The problem is that, everyday after the index creation, the cluster goes to yellow state with too many unassigned indices and takes a lot of time to recover. Following is the state of the cluster after 4.5 hours of new index creation.

{
"cluster_name" : <cluster_name>,
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 35,
"number_of_data_nodes" : 32,
"active_primary_shards" : 48589,
"active_shards" : 89434,
"relocating_shards" : 0,
"initializing_shards" : 30,
"unassigned_shards" : 7760,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 1936,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 2036819,
"active_shards_percent_as_number" : 91.98757508434132
}

The reasons for shards being unassigned are the following:

INDEX_CREATED - on newly created indices
NODE_LEFT - on older indices

The following is the output from _cat/allocation/explain api

.
.
.
.
{
"node_id" : "XZAy1ITXSjmO_a-8yulfig",
"node_name" : "node1",
"transport_address" : ":9300",
"node_attributes" : {
"ml.max_open_jobs" : "10",
"ml.enabled" : "true"
},
"node_decision" : "throttled",
"deciders" : [
{
"decider" : "throttling",
"decision" : "THROTTLE",
"explanation" : "reached the limit of incoming shard recoveries [10], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=10] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
]
},
{
"node_id" : "dnB6TExcSOaI_poL3d3t9g",
"node_name" : "node2",
"transport_address" : ":9300",
"node_attributes" : {
"ml.max_open_jobs" : "10",
"ml.enabled" : "true"
},
"node_decision" : "throttled",
"store" : {
"matching_sync_id" : true
},
"deciders" : [
{
"decider" : "throttling",
"decision" : "THROTTLE",
"explanation" : "reached the limit of incoming shard recoveries [10], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=10] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
]
},
{
"node_id" : "BrIKoXg4TkuJ2Ekn8YKuPQ",
"node_name" : "node3",
"transport_address" : ":9300",
"node_attributes" : {
"ml.max_open_jobs" : "10",
"ml.enabled" : "true"
},
"node_decision" : "no",
"store" : {
"matching_sync_id" : true
},
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[logstash-hadoop_axonitered-hdfsproxy-audit-s3-2017.12.15][18], node[BrIKoXg4TkuJ2Ekn8YKuPQ], [P], s[STARTED], a[id=OzSdAIySQFWKj7PuzUHbjQ]]"
}
]
}

We don't want the cluster to go into yellow state everyday where thousands of shards go unassigned and takes time to recover. We want the index creation, shard distribution of both primary and replica to be smooth.Would increasing the number of shards per index help. If yes, by how much?

Also, what does NODE_LEFT mean? I do not see any services getting restarted on nodes.

warkolm · December 20, 2017, 6:45am

That is waaaaaaaayyyyyyyy too many. You need to reduce that quite dramatically.

zqc0512 · December 20, 2017, 7:10am

the shards set about indices how many? can change the cluster.routing.allocation.node_initial_primaries_recoveries: big number
output the pending_tasks see what pending..

sbanagiri · December 20, 2017, 10:33am

Following is the subset of output from the pending tasks

{
"insert_order" : 180619,
"priority" : "URGENT",
"source" : "shard-started shard id [[logstash-hadoop_radiumtan-oozie-oozie-ops-s3-2017.12.17][5]], allocation id [2k_o8_i9RtO3zF6AaG_6yA], primary term [0], message [after peer recovery]",
"executing" : false,
"time_in_queue_millis" : 12994,
"time_in_queue" : "12.9s"
},
{
"insert_order" : 179771,
"priority" : "HIGH",
"source" : "shard-failed",
"executing" : false,
"time_in_queue_millis" : 438956,
"time_in_queue" : "7.3m"
},
{
"insert_order" : 179758,
"priority" : "HIGH",
"source" : "shard-failed",
"executing" : false,
"time_in_queue_millis" : 461314,
"time_in_queue" : "7.6m"
},
{
"insert_order" : 179772,
"priority" : "HIGH",
"source" : "shard-failed",
"executing" : false,
"time_in_queue_millis" : 438956,
"time_in_queue" : "7.3m"
},
{
"insert_order" : 179757,
"priority" : "HIGH",
"source" : "shard-failed",
"executing" : false,
"time_in_queue_millis" : 461485,
"time_in_queue" : "7.6m"
},
{
"insert_order" : 179751,
"priority" : "HIGH",
"source" : "shard-failed",
"executing" : false,
"time_in_queue_millis" : 462079,
"time_in_queue" : "7.7m"
},
{
"insert_order" : 179763,
"priority" : "HIGH",
"source" : "shard-failed",
"executing" : false,
"time_in_queue_millis" : 460810,
"time_in_queue" : "7.6m"
},
{
"insert_order" : 179765,
"priority" : "HIGH",
"source" : "shard-failed",
"executing" : false,
"time_in_queue_millis" : 458949,
"time_in_queue" : "7.6m"
},
{
"insert_order" : 179796,
"priority" : "HIGH",
"source" : "shard-failed",
"executing" : false,
"time_in_queue_millis" : 437708,
"time_in_queue" : "7.2m"
},

warkolm · December 20, 2017, 10:37am

You have too many shards.

sbanagiri · December 20, 2017, 2:26pm

Number of shards needs to be reduced per index? @warkolm

Christian_Dahlqvist · December 20, 2017, 4:33pm

I suspect you need to reduce the number of indices as well as the number of shards. Please read this blog post about shards and sharding practices for some guidelines.

pinky.floyano · December 21, 2017, 4:14pm

Excuse me for interrupting, I have a similar problem.

http://localhost:9200/_cat/shards

censos2 0 p STARTED 6608080 199.7mb 192.168.0.104 nodo2
censos2 0 r UNASSIGNED
.kibana 0 p STARTED 2 6.1kb 192.168.0.104 nodo2
.kibana 0 r UNASSIGNED
.marvel-es-data-1 0 p STARTED 4 4.5kb 192.168.0.104 nodo2
.marvel-es-data-1 0 r UNASSIGNED
.marvel-es-1-2017.12.21 0 p STARTED 95 147.5kb 192.168.0.104 nodo2
.marvel-es-1-2017.12.21 0 r UNASSIGNED
censos 3 p STARTED 26732520 760.9mb 192.168.0.105 nodo3
censos 3 r STARTED 26732520 760.9mb 192.168.0.104 nodo2
censos 4 p STARTED 26678994 717.3mb 192.168.0.105 nodo3
censos 4 r STARTED 26678994 717.3mb 192.168.0.104 nodo2
censos 1 p STARTED 26722913 688.5mb 192.168.0.105 nodo3
censos 1 r STARTED 26722913 688.5mb 192.168.0.104 nodo2
censos 2 p STARTED 26720830 701.6mb 192.168.0.105 nodo3
censos 2 r STARTED 26720830 701.6mb 192.168.0.104 nodo2
censos 0 p STARTED 26689213 695.7mb 192.168.0.105 nodo3
censos 0 r STARTED 26689213 695.7mb 192.168.0.104 nodo2

http://localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason

censos2 0 p STARTED
censos2 0 r UNASSIGNED CLUSTER_RECOVERED
.kibana 0 p STARTED
.kibana 0 r UNASSIGNED CLUSTER_RECOVERED
.marvel-es-data-1 0 p STARTED
.marvel-es-data-1 0 r UNASSIGNED INDEX_CREATED
.marvel-es-1-2017.12.21 0 p STARTED
.marvel-es-1-2017.12.21 0 r UNASSIGNED INDEX_CREATED
censos 3 p STARTED
censos 3 r STARTED
censos 4 p STARTED
censos 4 r STARTED
censos 1 p STARTED
censos 1 r STARTED
censos 2 p STARTED
censos 2 r STARTED
censos 0 p STARTED
censos 0 r STARTED

Considering that this did not happen before, but it started a couple of weeks ago. Sometimes I solve it, first stopping the allocation of fragments and then restarting the cluster, and after that, I activate the fragment assignment again.

But when every day the marvel indexes are automatically created again, the same thing happens, and it is tedious to have to do the same and the same thing every day, and sometimes I have to do it several times a day for this problem, someone knows the final solution so that this does not happen again?

warkolm · December 22, 2017, 3:06am

It'd be better if you created a new thread please

pinky.floyano · December 22, 2017, 12:47pm

Lo probaré. Gracias

sbanagiri · January 9, 2018, 9:33am

Reduced the number of indices and the number of shards.

{
"cluster_name": "xxx",
"status": "green",
"timed_out": false,
"number_of_nodes": 35,
"number_of_data_nodes": 32,
"active_primary_shards": 4184,
"active_shards": 8414,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 100
}

The cluster is very stable since then.

system · February 6, 2018, 9:34am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shards getting unassigned Randomly Elasticsearch docker	9	785	March 4, 2020
Unassigned replica shards with reason as INDEX_CREATED Elasticsearch	5	9214	July 22, 2020
All indices reside in "unassigned shards" Elasticsearch	3	766	October 8, 2019
Getting occasional unassigned replica shards Elasticsearch	2	415	July 6, 2017
ERROR: All active_shards are unassigned_shards in ELK 5.4 Elasticsearch	8	1546	June 29, 2017

Too many unassigned replica shards everyday

Related topics