No shard is getting assigned on one of the data node

pratiksha · December 21, 2021, 2:21pm

I have a cluster running with 10 data, 3 master and 3 client nodes. I am facing issue on one of data node where no shard is able to get assign on it.
_cluster/allocation/explain is giving below message:

{
      "node_id" : "TULFVEcrTxCfINXVazprLA",
      "node_name" : "data-1",
      "transport_address" : "192.168.16.126:9300",
      "node_decision" : "throttled",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of ongoing initial primary recoveries [4], cluster setting [cluster.routing.allocation.node_initial_primaries_recoveries=4]"
        }
      ]
    }

I see below log message frequently in data-1 logs:
{"type":"log","host":"Elasticsearch-data-1","level":"WARN","time": "2021-09-18T15:03:32.407Z","logger":"o.e.d.z.ZenDiscovery","timezone":"UTC","marker":"[Elasticsearch-data-1] ","log":"dropping pending state [[uuid[p2MPdq59QCKtduezFv9Y2A], v[175229293], m[NAlqPclnQY-25G9p6_4mBA]]]. more than [25] pending states."}

Also, even though shards are getting assigned on other data nodes, document count for them is still zero and I did not find any relevant message in the logs which could imply this behaviour. There are continuous garbage collector logs which can be seen:

{"type":"lob94d-ckhfw","level":"WARN","time": "2021-12-17T03:51:10.869Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[elasticsearch-master-55ff74b94d-ckhfw] ","log":"[gc][36006457] overhead, spent [754ms] collecting in the last [1.2s]"}
{"type":"log","host":"elasticsearch-master-55ff74b94d-ckhfw","level":"WARN","time": "2021-12-17T04:09:04.407Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[elasticsearch-master-55ff74b94d-ckhfw] ","log":"[gc][young][36007530][693943] duration [1s], collections [1]/[1.2s], total [1s]/[13.4h], memory [8gb]->[7.8gb]/[15.9gb], all_pools {[young] [264.4mb]->[4.1mb]/[266.2mb]}{[survivor] [5.1mb]->[3.7mb]/[33.2mb]}{[old] [7.8gb]->[7.8gb]/[15.6gb]}"}

Note: JVM Heap is set to 32GB, since I have set it to max allowed value Can this cause such issue?
Please look into this and check about what could be the reason and how to overcome this.

Christian_Dahlqvist · December 21, 2021, 3:53pm

Are all nodes running exactly the same version?

pratiksha · December 21, 2021, 4:01pm

Yes, all nodes are running on exactly same version.

Christian_Dahlqvist · December 21, 2021, 4:04pm

Can you please show the full output of the cluster stats API?

pratiksha · December 21, 2021, 4:07pm

cluster/stats API output:

{
   "_nodes" : {
     "total" : 16,
     "successful" : 16,
     "failed" : 0
   },
   "cluster_name" : "elastic-dr-elastic",
   "cluster_uuid" : "btgycnVpQdKbfefT8TOBg",
   "timestamp" : 1639718609354,
   "status" : "yellow",
   "indices" : {
     "count" : 229,
     "shards" : {
       "total" : 2709,
       "primaries" : 1351,
       "replication" : 1.005181347150259,
       "index" : {
         "shards" : {
           "min" : 2,
           "max" : 12,
           "avg" : 11.829694323144105
         },
         "primaries" : {
           "min" : 1,
           "max" : 6,
           "avg" : 5.899563318777292
         },
         "replication" : {
           "min" : 1.0,
           "max" : 8.0,
           "avg" : 1.0305676855895196
         }
       }
     },
     "docs" : {
       "count" : 9,
       "deleted" : 0
     },
     "store" : {
       "size_in_bytes" : 998468
     },
     "fielddata" : {
       "memory_size_in_bytes" : 0,
       "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 0,
      "total_count" : 0,
      "hit_count" : 0,
      "miss_count" : 0,
      "cache_size" : 0,
      "cache_count" : 0,
      "evictions" : 0
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 37,
      "memory_in_bytes" : 65075,
      "terms_memory_in_bytes" : 41354,
      "stored_fields_memory_in_bytes" : 15688,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 4288,
      "points_memory_in_bytes" : 45,
      "doc_values_memory_in_bytes" : 3700,
      "index_writer_memory_in_bytes" : 0,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 16,
      "data" : 10,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 16
    },
    "versions" : [
      "6.6.1"
    ],
     "os" : {
       "available_processors" : 144,
       "allocated_processors" : 144,
       "names" : [
         {
           "name" : "Linux",
           "count" : 16
         }
       ],
       "pretty_names" : [
         {
           "pretty_name" : "CentOS Linux 7 (Core)",
           "count" : 16
         }
       ],
       "mem" : {
         "total_in_bytes" : 1619268435968,
         "free_in_bytes" : 714996961280,
         "used_in_bytes" : 904271474688,
         "free_percent" : 44,
         "used_percent" : 56
       }
     },
     "process" : {
       "cpu" : {
         "percent" : 0
       },
       "open_file_descriptors" : {
         "min" : 602,
         "max" : 1314,
         "avg" : 1009
       }
     },
     "jvm" : {
       "max_uptime_in_millis" : 36969780618,
       "versions" : [
         {
           "version" : "1.8.0_212",
           "vm_name" : "OpenJDK 64-Bit Server VM",
           "vm_version" : "25.212-b04",
           "vm_vendor" : "Oracle Corporation",
           "count" : 16
         }
       ],
       "mem" : {
         "heap_used_in_bytes" : 61973215288,
         "heap_max_in_bytes" : 413382868992
       },
       "threads" : 1535
     },
     "fs" : {
       "total_in_bytes" : 119748558626816,
       "free_in_bytes" : 119656676990976,
       "available_in_bytes" : 119615178137600
     },
     "plugins" : [
       {
         "name" : "ingest-user-agent",
         "version" : "6.6.1",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Ingest processor that extracts information from a user agent",
         "classname" : "org.elasticsearch.ingest.useragent.IngestUserAgentPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "search-guard-6",
         "version" : "6.6.1-24.3",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Provide access control related features for Elasticsearch 6",
         "classname" : "com.floragunn.searchguard.SearchGuardPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "ingest-geoip",
         "version" : "6.6.1",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Ingest processor that uses looksup geo data based on ip adresses using the Maxmind geo database",
         "classname" : "org.elasticsearch.ingest.geoip.IngestGeoIpPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "prometheus-exporter",
         "version" : "6.6.1.0",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Export Elasticsearch metrics to Prometheus",
         "classname" : "org.elasticsearch.plugin.prometheus.PrometheusExporterPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "ingest-attachment",
         "version" : "6.6.1",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Ingest processor that uses Apache Tika to extract contents",
         "classname" : "org.elasticsearch.ingest.attachment.IngestAttachmentPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       }
     ],
     "network_types" : {
       "transport_types" : {
         "com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyTransport" : 16
       },
       "http_types" : {
         "com.floragunn.searchguard.http.SearchGuardHttpServerTransport" : 16
       }
     }
   }
 }

DavidTurner · December 21, 2021, 4:55pm

This is pretty much the same thing as your other thread in which your cluster seems heavily overloaded and is running a very old version. You have less than 1MB of data but well over 1000 primary shards.

As mentioned before, you should upgrade as a matter of some urgency. Newer versions will deal with high shard counts better, and also have better diagnostic tooling for working out what else might be going wrong.

In the meantime I expect it will help to delete any indices you don't need.

pratiksha · December 22, 2021, 7:21am

Hi @DavidTurner
Can you please explain in what ways the newer version will handle this issue in a better way and how we can diagnose the dropping pending state issue.

DavidTurner · December 22, 2021, 8:24am

There have been many improvements in the ~3 years since 6.6 was released. The whole concept of "pending state" no longer exists, for instance.

system · January 19, 2022, 8:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Receiving "timed out waiting for all nodes to process published state" Elasticsearch	14	1487	January 18, 2022
Data node lost, all shards go to RED - Data node returns but shards lost forever Elasticsearch	10	3610	July 24, 2019
Diagnosing why one node is getting no shards Elasticsearch	3	948	July 6, 2017
Problems with Shard allocation Elasticsearch docker	3	1083	November 11, 2021
Shards Not Being Allocated To Nodes Elasticsearch	7	2221	November 11, 2021

No shard is getting assigned on one of the data node

Related topics