We have a 5 node cluster that is not reallocating shards after the primary shards were recovered. To fix this issue we increased the heap size of the server from 80GB (160GB RAM) to 192GB (384GB RAM). This is also because there were around 1000 jobs in the queue and it was slowing down the recovery. Initially, we had a 3 node cluster but we added 2 more data nodes to fix this issue as we have two indexes (650GB) with 14 shards. All shard sizes are below 50GB. Now the replicas are not allocated except few and bulk indexing is rejected by the cluster. This is affecting the application in production and so far we couldn't find any solution.
The issue started after we upgraded ES from 7.5.1 to 7.6.1. We had changed the thread_pool.search.max_queue_size to 2500 from 1000 during this upgrade. The cluster was green after the upgrade but it was not allowing bulk indexing. That's when we restarted the cluster and it never turned green. We changed replica to 0 and then to 2 again but its the same. The allocation stops after some time.
thread_pool.search.max_queue_size has been changed back to 1000
The cluster setting is as follows:
_cluster/health
{
"cluster_name": "xxxcluster",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 5,
"number_of_data_nodes": 5,
"active_primary_shards": 362,
"active_shards": 517,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 552,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 96,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 12906848,
"active_shards_percent_as_number": 48.36295603367633
}
=========================================
_cluster/settings
{
"persistent": {
"cluster": {
"routing": {
"rebalance": {
"enable": "all"
},
"allocation": {
"cluster_concurrent_rebalance": "5",
"node_concurrent_recoveries": "5",
"enable": "all"
}
}
},
"indices": {
"recovery": {
"max_bytes_per_sec": "200mb"
}
},
"xpack": {
"monitoring": {
"collection": {
"enabled": "true"
}
}
}
},
"transient": { }
}
=====================================
_cluster/allocation/explain?pretty
{
"index": "reindexed-v6-mexplorer_entity3",
"shard": 2,
"primary": false,
"current_state": "unassigned",
"unassigned_info": {
"reason": "REPLICA_ADDED",
"at": "2020-03-17T10:34:30.540Z",
"last_allocation_status": "no_attempt"
},
"can_allocate": "yes",
"allocate_explanation": "can allocate the shard",
"target_node": {
"id": "cx9cWDAlTWKVjnYrbq8J7w",
"name": "prod-es01",
"transport_address": "192.168.xx.01:9300",
"attributes": {
"ml.machine_memory": "169012310016",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
}
},
"node_allocation_decisions": [
{
"node_id": "ZpI06DeQRluhoKp-jr0FoQ",
"node_name": "prod-es05",
"transport_address": "192.168.xx.05:9300",
"node_attributes": {
"ml.machine_memory": "169009373184",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
},
"node_decision": "yes",
"weight_ranking": 1
}
,
{
"node_id": "4Yuc1QkpQveiP_ccssUZGg",
"node_name": "prod-es04",
"transport_address": "192.168.xx.04:9300",
"node_attributes": {
"ml.machine_memory": "169009377280",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
},
"node_decision": "yes",
"weight_ranking": 2
}
,
{
"node_id": "cx9cWDAlTWKVjnYrbq8J7w",
"node_name": "prod-es01",
"transport_address": "192.168.xx.01:9300",
"node_attributes": {
"ml.machine_memory": "169012310016",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
},
"node_decision": "yes",
"weight_ranking": 3
}
,
{
"node_id": "kuTkuFzKSUSHkoInHW-SLw",
"node_name": "prod-es02",
"transport_address": "192.168.xx.02:9300",
"node_attributes": {
"ml.machine_memory": "169012310016",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
},
"node_decision": "yes",
"weight_ranking": 4
}
,
{
"node_id": "-z-XdRisTIy2uaAv2unwrw",
"node_name": "prod-es03",
"transport_address": "192.168.xx.03:9300",
"node_attributes": {
"ml.machine_memory": "169012310016",
"xpack.installed": "true",
"ml.max_open_jobs": "20"
},
"node_decision": "no",
"weight_ranking": 5,
"deciders": [
{
"decider": "same_shard",
"decision": "NO",
"explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[reindexed-v6-mexplorer_entity3][2], node[-z-XdRisTIy2uaAv2unwrw], [P], s[STARTED], a[id=rf4jV-FxQ1qrmrHTETqM2Q]]"
}
]
}
]
}
=================================
Cluster logs:
[2020-03-17T15:42:18,800][DEBUG][o.e.a.a.i.m.p.TransportPutMappingAction] [prod-es03] failed to put mappings on indices [[[building_1301_1400_/kcmbE4g1S4W_iaA5OCgm9A]]], type [_doc]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:143) [elasticsearch-7.6.1.jar:7.6.1]
at java.util.ArrayList.forEach(ArrayList.java:1507) [?:?]
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:142) [elasticsearch-7.6.1.jar:7.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-03-17T15:45:31,787][DEBUG][o.e.a.a.c.a.TransportClusterAllocationExplainAction] [prod-es03] explaining the allocation for [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false], found shard [[reindexed-v6-mexplorer_entity3][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2020-03-17T10:34:30.540Z], delayed=false, allocation_status[no_attempt]]]
[2020-03-17T15:49:10,836][INFO ][o.e.m.j.JvmGcMonitorService] [prod-es03] [gc][20765] overhead, spent [250ms] collecting in the last [1s]
[2020-03-17T15:52:09,159][INFO ][o.e.m.j.JvmGcMonitorService] [prod-es03] [gc][20943] overhead, spent [301ms] collecting in the last [1s]
===================================
Cluster master node elasticsearch.yml
node.name: prod-es03
discovery.zen.minimum_master_nodes: 2
gateway.recover_after_nodes: 3
gateway.expected_nodes: 5
gateway.recover_after_time: 5m
discovery.zen.ping_timeout: 5s
discovery.zen.ping.unicast.hosts: ["192.168.xx.01","192.168.xx.02","192.168.xx.03,192.168.90.04,192.168.xx.05"]
cluster.initial_master_nodes: ["192.168.xx.01","192.168.xx.02","192.168.xx.03"]
node.master: true
cluster.name: xxxcluster
===================================
4 Data disks. Throughput 250MB/s. IOPS 7500 each
/dev/sdc1 2.0T 894G 1.1T 47% /data3
/dev/sdf1 2.0T 982G 942G 52% /data1
/dev/sdd1 2.0T 924G 1000G 49% /data
/dev/sde1 2.0T 892G 1.1T 47% /data2