Elasticsearch is not reallocating shards after the primary shards were recovered. Unable to perform bulk indexing

We have a 5 node cluster that is not reallocating shards after the primary shards were recovered. To fix this issue we increased the heap size of the server from 80GB (160GB RAM) to 192GB (384GB RAM). This is also because there were around 1000 jobs in the queue and it was slowing down the recovery. Initially, we had a 3 node cluster but we added 2 more data nodes to fix this issue as we have two indexes (650GB) with 14 shards. All shard sizes are below 50GB. Now the replicas are not allocated except few and bulk indexing is rejected by the cluster. This is affecting the application in production and so far we couldn't find any solution.

The issue started after we upgraded ES from 7.5.1 to 7.6.1. We had changed the thread_pool.search.max_queue_size to 2500 from 1000 during this upgrade. The cluster was green after the upgrade but it was not allowing bulk indexing. That's when we restarted the cluster and it never turned green. We changed replica to 0 and then to 2 again but its the same. The allocation stops after some time.

thread_pool.search.max_queue_size has been changed back to 1000

The cluster setting is as follows:

_cluster/health
{
"cluster_name": "xxxcluster",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 5,
"number_of_data_nodes": 5,
"active_primary_shards": 362,
"active_shards": 517,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 552,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 96,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 12906848,
"active_shards_percent_as_number": 48.36295603367633
}

=========================================

_cluster/settings
{
"persistent": {
"cluster": {
"routing": {
"rebalance": {
"enable": "all"
},
"allocation": {
"cluster_concurrent_rebalance": "5",
"node_concurrent_recoveries": "5",
"enable": "all"
}
}
},
"indices": {
"recovery": {
"max_bytes_per_sec": "200mb"
}
},
"xpack": {
"monitoring": {
"collection": {
"enabled": "true"
}
}
}
},
"transient": { }
}

=====================================

_cluster/allocation/explain?pretty
{
"index": "reindexed-v6-mexplorer_entity3",
"shard": 2,
"primary": false,
"current_state": "unassigned",
"unassigned_info": {
"reason": "REPLICA_ADDED",
"at": "2020-03-17T10:34:30.540Z",
"last_allocation_status": "no_attempt"
},
"can_allocate": "yes",
"allocate_explanation": "can allocate the shard",
"target_node": {
"id": "cx9cWDAlTWKVjnYrbq8J7w",
"name": "prod-es01",
"transport_address": "192.168.xx.01:9300",
"attributes": {
"ml.machine_memory": "169012310016",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
}
},
"node_allocation_decisions": [
{
"node_id": "ZpI06DeQRluhoKp-jr0FoQ",
"node_name": "prod-es05",
"transport_address": "192.168.xx.05:9300",
"node_attributes": {
"ml.machine_memory": "169009373184",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
},
"node_decision": "yes",
"weight_ranking": 1
}
,
{
"node_id": "4Yuc1QkpQveiP_ccssUZGg",
"node_name": "prod-es04",
"transport_address": "192.168.xx.04:9300",
"node_attributes": {
"ml.machine_memory": "169009377280",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
},
"node_decision": "yes",
"weight_ranking": 2
}
,
{
"node_id": "cx9cWDAlTWKVjnYrbq8J7w",
"node_name": "prod-es01",
"transport_address": "192.168.xx.01:9300",
"node_attributes": {
"ml.machine_memory": "169012310016",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
},
"node_decision": "yes",
"weight_ranking": 3
}
,
{
"node_id": "kuTkuFzKSUSHkoInHW-SLw",
"node_name": "prod-es02",
"transport_address": "192.168.xx.02:9300",
"node_attributes": {
"ml.machine_memory": "169012310016",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
},
"node_decision": "yes",
"weight_ranking": 4
}
,
{
"node_id": "-z-XdRisTIy2uaAv2unwrw",
"node_name": "prod-es03",
"transport_address": "192.168.xx.03:9300",
"node_attributes": {
"ml.machine_memory": "169012310016",
"xpack.installed": "true",
"ml.max_open_jobs": "20"
},
"node_decision": "no",
"weight_ranking": 5,
"deciders": [
{
"decider": "same_shard",
"decision": "NO",
"explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[reindexed-v6-mexplorer_entity3][2], node[-z-XdRisTIy2uaAv2unwrw], [P], s[STARTED], a[id=rf4jV-FxQ1qrmrHTETqM2Q]]"
}
]
}
]
}

=================================

Cluster logs:
[2020-03-17T15:42:18,800][DEBUG][o.e.a.a.i.m.p.TransportPutMappingAction] [prod-es03] failed to put mappings on indices [[[building_1301_1400_/kcmbE4g1S4W_iaA5OCgm9A]]], type [_doc]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s
        at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:143) [elasticsearch-7.6.1.jar:7.6.1]
        at java.util.ArrayList.forEach(ArrayList.java:1507) [?:?]
        at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:142) [elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-03-17T15:45:31,787][DEBUG][o.e.a.a.c.a.TransportClusterAllocationExplainAction] [prod-es03] explaining the allocation for [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false], found shard [[reindexed-v6-mexplorer_entity3][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2020-03-17T10:34:30.540Z], delayed=false, allocation_status[no_attempt]]]
[2020-03-17T15:49:10,836][INFO ][o.e.m.j.JvmGcMonitorService] [prod-es03] [gc][20765] overhead, spent [250ms] collecting in the last [1s]
[2020-03-17T15:52:09,159][INFO ][o.e.m.j.JvmGcMonitorService] [prod-es03] [gc][20943] overhead, spent [301ms] collecting in the last [1s]

===================================

Cluster master node elasticsearch.yml

node.name: prod-es03

discovery.zen.minimum_master_nodes: 2

gateway.recover_after_nodes: 3

gateway.expected_nodes: 5
gateway.recover_after_time: 5m

discovery.zen.ping_timeout: 5s

discovery.zen.ping.unicast.hosts: ["192.168.xx.01","192.168.xx.02","192.168.xx.03,192.168.90.04,192.168.xx.05"]

cluster.initial_master_nodes: ["192.168.xx.01","192.168.xx.02","192.168.xx.03"]
node.master: true

cluster.name: xxxcluster

===================================

4 Data disks. Throughput 250MB/s. IOPS 7500 each
/dev/sdc1       2.0T  894G  1.1T  47% /data3
/dev/sdf1       2.0T  982G  942G  52% /data1
/dev/sdd1       2.0T  924G 1000G  49% /data
/dev/sde1       2.0T  892G  1.1T  47% /data2

At least one task on the master has been waiting for over 3.5 hours. That seems bad. Can you share GET _cluster/pending_tasks and GET _nodes/hot_threads?threads=999999 please?

Sure! Thanks for your quick response

GET hot_threads and pending tasks

There's 4 URGENT tasks right now, one of which is running:

{"insert_order":54309,"priority":"URGENT","source":"create-index [metricbeat-2020.03.17], cause [auto(bulk api)]","executing":true,"time_in_queue_millis":2987,"time_in_queue":"2.9s"}
{"insert_order":54310,"priority":"URGENT","source":"create-index [metricbeat-2020.03.17], cause [auto(bulk api)]","executing":false,"time_in_queue_millis":2067,"time_in_queue":"2s"}
{"insert_order":54311,"priority":"URGENT","source":"create-index [metricbeat-2020.03.17], cause [auto(bulk api)]","executing":false,"time_in_queue_millis":1231,"time_in_queue":"1.2s"}
{"insert_order":54312,"priority":"URGENT","source":"create-index [metricbeat-2020.03.17], cause [auto(bulk api)]","executing":false,"time_in_queue_millis":195,"time_in_queue":"195ms"}

They've not been running for very long, so it doesn't look like they're stuck. I would guess that they are happening often enough to cause starvation for lower-priority tasks, but I don't see why create-index [metricbeat-2020.03.17] would be happening repeatedly. Once the index has been created the nodes should move on.

If you GET _cluster/pending_tasks again do you see more instances of this create-index task occurring (with higher and higher insert_orders)? If not, what else do you see? Are you by chance also deleting this index repeatedly as it's being created?

There is no metricbeat index in the cluster. I have stopped the metricbeat service and now all jobs got processed within seconds. I appreciate your help in identifying this.

Below job was the culprit

  "tasks" : [
    {
      "insert_order" : 58385,
      "priority" : "URGENT",
      "source" : "create-index [metricbeat-2020.03.17], cause [auto(bulk api)]",
      "executing" : true,
      "time_in_queue_millis" : 2336,
      "time_in_queue" : "2.3s"
    },

The below job was waiting for 5.4 hours because metricbeat kept creating indexes with URGENT priority

    {
      "insert_order" : 27571,
      "priority" : "HIGH",
      "source" : "shard-failed",
      "executing" : false,
      "time_in_queue_millis" : 19630338,
      "time_in_queue" : "5.4h"
    },

Now we are waiting for the cluster to replicate. Thank you!

Was there anything in the master's logs about failing to create an index? I think that would result in an exception being logged. If so, can you share it?

I can see the below logs:

[2020-03-17T10:07:28,149][DEBUG][o.e.a.a.i.t.p.TransportPutIndexTemplateAction] [prod-es03] failed to put template [metricbeat]

org.elasticsearch.index.mapper.MapperParsingException: Failed to parse mapping [_doc]: Root mapping definition has unsupported parameters:  [_default_ : {_meta={version=5.6.0}, dynamic_templates=[{strings_as_keyword={mapping={ignore_above=1024, type=keyword},
yum list installed |grep metricbeat
metricbeat.x86_64                   5.6.0-1                         installed

The cluster version is 7.6.1

Nope, that's related to updating a mapping on a template, nothing to do with creating an index.

I checked the last 2 weeks logs after unzipping them but couldn't find any "failed to create index" for metricbeat.

 grep -ri "failed to create index" . >error.log
 grep -ri "metricbeat" . >error.log
1 Like

Ok, thanks for checking.