Elasticsearch is not reallocating shards after the primary shards were recovered. Unable to perform bulk indexing

Jehu_T · March 17, 2020, 4:15pm

We have a 5 node cluster that is not reallocating shards after the primary shards were recovered. To fix this issue we increased the heap size of the server from 80GB (160GB RAM) to 192GB (384GB RAM). This is also because there were around 1000 jobs in the queue and it was slowing down the recovery. Initially, we had a 3 node cluster but we added 2 more data nodes to fix this issue as we have two indexes (650GB) with 14 shards. All shard sizes are below 50GB. Now the replicas are not allocated except few and bulk indexing is rejected by the cluster. This is affecting the application in production and so far we couldn't find any solution.

The issue started after we upgraded ES from 7.5.1 to 7.6.1. We had changed the thread_pool.search.max_queue_size to 2500 from 1000 during this upgrade. The cluster was green after the upgrade but it was not allowing bulk indexing. That's when we restarted the cluster and it never turned green. We changed replica to 0 and then to 2 again but its the same. The allocation stops after some time.

thread_pool.search.max_queue_size has been changed back to 1000

The cluster setting is as follows:

_cluster/health
{
"cluster_name": "xxxcluster",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 5,
"number_of_data_nodes": 5,
"active_primary_shards": 362,
"active_shards": 517,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 552,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 96,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 12906848,
"active_shards_percent_as_number": 48.36295603367633
}

=========================================

_cluster/settings
{
"persistent": {
"cluster": {
"routing": {
"rebalance": {
"enable": "all"
},
"allocation": {
"cluster_concurrent_rebalance": "5",
"node_concurrent_recoveries": "5",
"enable": "all"
}
}
},
"indices": {
"recovery": {
"max_bytes_per_sec": "200mb"
}
},
"xpack": {
"monitoring": {
"collection": {
"enabled": "true"
}
}
}
},
"transient": { }
}

=====================================

_cluster/allocation/explain?pretty
{
"index": "reindexed-v6-mexplorer_entity3",
"shard": 2,
"primary": false,
"current_state": "unassigned",
"unassigned_info": {
"reason": "REPLICA_ADDED",
"at": "2020-03-17T10:34:30.540Z",
"last_allocation_status": "no_attempt"
},
"can_allocate": "yes",
"allocate_explanation": "can allocate the shard",
"target_node": {
"id": "cx9cWDAlTWKVjnYrbq8J7w",
"name": "prod-es01",
"transport_address": "192.168.xx.01:9300",
"attributes": {
"ml.machine_memory": "169012310016",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
}
},
"node_allocation_decisions": [
{
"node_id": "ZpI06DeQRluhoKp-jr0FoQ",
"node_name": "prod-es05",
"transport_address": "192.168.xx.05:9300",
"node_attributes": {
"ml.machine_memory": "169009373184",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
},
"node_decision": "yes",
"weight_ranking": 1
}
,
{
"node_id": "4Yuc1QkpQveiP_ccssUZGg",
"node_name": "prod-es04",
"transport_address": "192.168.xx.04:9300",
"node_attributes": {
"ml.machine_memory": "169009377280",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
},
"node_decision": "yes",
"weight_ranking": 2
}
,
{
"node_id": "cx9cWDAlTWKVjnYrbq8J7w",
"node_name": "prod-es01",
"transport_address": "192.168.xx.01:9300",
"node_attributes": {
"ml.machine_memory": "169012310016",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
},
"node_decision": "yes",
"weight_ranking": 3
}
,
{
"node_id": "kuTkuFzKSUSHkoInHW-SLw",
"node_name": "prod-es02",
"transport_address": "192.168.xx.02:9300",
"node_attributes": {
"ml.machine_memory": "169012310016",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
},
"node_decision": "yes",
"weight_ranking": 4
}
,
{
"node_id": "-z-XdRisTIy2uaAv2unwrw",
"node_name": "prod-es03",
"transport_address": "192.168.xx.03:9300",
"node_attributes": {
"ml.machine_memory": "169012310016",
"xpack.installed": "true",
"ml.max_open_jobs": "20"
},
"node_decision": "no",
"weight_ranking": 5,
"deciders": [
{
"decider": "same_shard",
"decision": "NO",
"explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[reindexed-v6-mexplorer_entity3][2], node[-z-XdRisTIy2uaAv2unwrw], [P], s[STARTED], a[id=rf4jV-FxQ1qrmrHTETqM2Q]]"
}
]
}
]
}

=================================

Cluster logs:
[2020-03-17T15:42:18,800][DEBUG][o.e.a.a.i.m.p.TransportPutMappingAction] [prod-es03] failed to put mappings on indices [[[building_1301_1400_/kcmbE4g1S4W_iaA5OCgm9A]]], type [_doc]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s
        at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:143) [elasticsearch-7.6.1.jar:7.6.1]
        at java.util.ArrayList.forEach(ArrayList.java:1507) [?:?]
        at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:142) [elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-03-17T15:45:31,787][DEBUG][o.e.a.a.c.a.TransportClusterAllocationExplainAction] [prod-es03] explaining the allocation for [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false], found shard [[reindexed-v6-mexplorer_entity3][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2020-03-17T10:34:30.540Z], delayed=false, allocation_status[no_attempt]]]
[2020-03-17T15:49:10,836][INFO ][o.e.m.j.JvmGcMonitorService] [prod-es03] [gc][20765] overhead, spent [250ms] collecting in the last [1s]
[2020-03-17T15:52:09,159][INFO ][o.e.m.j.JvmGcMonitorService] [prod-es03] [gc][20943] overhead, spent [301ms] collecting in the last [1s]

===================================

Cluster master node elasticsearch.yml

node.name: prod-es03

discovery.zen.minimum_master_nodes: 2

gateway.recover_after_nodes: 3

gateway.expected_nodes: 5
gateway.recover_after_time: 5m

discovery.zen.ping_timeout: 5s

discovery.zen.ping.unicast.hosts: ["192.168.xx.01","192.168.xx.02","192.168.xx.03,192.168.90.04,192.168.xx.05"]

cluster.initial_master_nodes: ["192.168.xx.01","192.168.xx.02","192.168.xx.03"]
node.master: true

cluster.name: xxxcluster

===================================

4 Data disks. Throughput 250MB/s. IOPS 7500 each
/dev/sdc1       2.0T  894G  1.1T  47% /data3
/dev/sdf1       2.0T  982G  942G  52% /data1
/dev/sdd1       2.0T  924G 1000G  49% /data
/dev/sde1       2.0T  892G  1.1T  47% /data2

DavidTurner · March 17, 2020, 4:47pm

At least one task on the master has been waiting for over 3.5 hours. That seems bad. Can you share GET _cluster/pending_tasks and GET _nodes/hot_threads?threads=999999 please?

Jehu_T · March 17, 2020, 5:17pm

Sure! Thanks for your quick response

GET hot_threads and pending tasks

DavidTurner · March 17, 2020, 5:39pm

There's 4 URGENT tasks right now, one of which is running:

{"insert_order":54309,"priority":"URGENT","source":"create-index [metricbeat-2020.03.17], cause [auto(bulk api)]","executing":true,"time_in_queue_millis":2987,"time_in_queue":"2.9s"}
{"insert_order":54310,"priority":"URGENT","source":"create-index [metricbeat-2020.03.17], cause [auto(bulk api)]","executing":false,"time_in_queue_millis":2067,"time_in_queue":"2s"}
{"insert_order":54311,"priority":"URGENT","source":"create-index [metricbeat-2020.03.17], cause [auto(bulk api)]","executing":false,"time_in_queue_millis":1231,"time_in_queue":"1.2s"}
{"insert_order":54312,"priority":"URGENT","source":"create-index [metricbeat-2020.03.17], cause [auto(bulk api)]","executing":false,"time_in_queue_millis":195,"time_in_queue":"195ms"}

They've not been running for very long, so it doesn't look like they're stuck. I would guess that they are happening often enough to cause starvation for lower-priority tasks, but I don't see why create-index [metricbeat-2020.03.17] would be happening repeatedly. Once the index has been created the nodes should move on.

If you GET _cluster/pending_tasks again do you see more instances of this create-index task occurring (with higher and higher insert_orders)? If not, what else do you see? Are you by chance also deleting this index repeatedly as it's being created?

Jehu_T · March 17, 2020, 6:22pm

There is no metricbeat index in the cluster. I have stopped the metricbeat service and now all jobs got processed within seconds. I appreciate your help in identifying this.

Below job was the culprit

  "tasks" : [
    {
      "insert_order" : 58385,
      "priority" : "URGENT",
      "source" : "create-index [metricbeat-2020.03.17], cause [auto(bulk api)]",
      "executing" : true,
      "time_in_queue_millis" : 2336,
      "time_in_queue" : "2.3s"
    },

The below job was waiting for 5.4 hours because metricbeat kept creating indexes with URGENT priority

    {
      "insert_order" : 27571,
      "priority" : "HIGH",
      "source" : "shard-failed",
      "executing" : false,
      "time_in_queue_millis" : 19630338,
      "time_in_queue" : "5.4h"
    },

Now we are waiting for the cluster to replicate. Thank you!

DavidTurner · March 17, 2020, 6:53pm

Was there anything in the master's logs about failing to create an index? I think that would result in an exception being logged. If so, can you share it?

Jehu_T · March 17, 2020, 7:10pm

I can see the below logs:

[2020-03-17T10:07:28,149][DEBUG][o.e.a.a.i.t.p.TransportPutIndexTemplateAction] [prod-es03] failed to put template [metricbeat]

org.elasticsearch.index.mapper.MapperParsingException: Failed to parse mapping [_doc]: Root mapping definition has unsupported parameters:  [_default_ : {_meta={version=5.6.0}, dynamic_templates=[{strings_as_keyword={mapping={ignore_above=1024, type=keyword},

yum list installed |grep metricbeat
metricbeat.x86_64                   5.6.0-1                         installed

The cluster version is 7.6.1

DavidTurner · March 17, 2020, 7:36pm

Nope, that's related to updating a mapping on a template, nothing to do with creating an index.

Jehu_T · March 18, 2020, 12:09pm

I checked the last 2 weeks logs after unzipping them but couldn't find any "failed to create index" for metricbeat.

 grep -ri "failed to create index" . >error.log
 grep -ri "metricbeat" . >error.log

DavidTurner · March 18, 2020, 12:27pm

Ok, thanks for checking.

system · April 15, 2020, 12:28pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shard reallocation stops Elasticsearch	11	4510	November 7, 2017
Shards fail to reallocate Elasticsearch	6	565	July 6, 2017
ES v5.4.0 Bulk Requests Rejection Elasticsearch	3	481	November 15, 2018
Shard allocation is not happening as expected after adding two more nodes to our ES cluster Elasticsearch	10	2589	May 14, 2019
Shard re-allocation taking a very long time Elasticsearch	16	7784	April 15, 2019

Elasticsearch is not reallocating shards after the primary shards were recovered. Unable to perform bulk indexing

Related topics