No shard is getting assigned on one of the data node

I have a cluster running with 10 data, 3 master and 3 client nodes. I am facing issue on one of data node where no shard is able to get assign on it.
_cluster/allocation/explain is giving below message:

{
      "node_id" : "TULFVEcrTxCfINXVazprLA",
      "node_name" : "data-1",
      "transport_address" : "192.168.16.126:9300",
      "node_decision" : "throttled",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of ongoing initial primary recoveries [4], cluster setting [cluster.routing.allocation.node_initial_primaries_recoveries=4]"
        }
      ]
    }

I see below log message frequently in data-1 logs:
{"type":"log","host":"Elasticsearch-data-1","level":"WARN","time": "2021-09-18T15:03:32.407Z","logger":"o.e.d.z.ZenDiscovery","timezone":"UTC","marker":"[Elasticsearch-data-1] ","log":"dropping pending state [[uuid[p2MPdq59QCKtduezFv9Y2A], v[175229293], m[NAlqPclnQY-25G9p6_4mBA]]]. more than [25] pending states."}

Also, even though shards are getting assigned on other data nodes, document count for them is still zero and I did not find any relevant message in the logs which could imply this behaviour. There are continuous garbage collector logs which can be seen:

{"type":"lob94d-ckhfw","level":"WARN","time": "2021-12-17T03:51:10.869Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[elasticsearch-master-55ff74b94d-ckhfw] ","log":"[gc][36006457] overhead, spent [754ms] collecting in the last [1.2s]"}
{"type":"log","host":"elasticsearch-master-55ff74b94d-ckhfw","level":"WARN","time": "2021-12-17T04:09:04.407Z","logger":"o.e.m.j.JvmGcMonitorService","timezone":"UTC","marker":"[elasticsearch-master-55ff74b94d-ckhfw] ","log":"[gc][young][36007530][693943] duration [1s], collections [1]/[1.2s], total [1s]/[13.4h], memory [8gb]->[7.8gb]/[15.9gb], all_pools {[young] [264.4mb]->[4.1mb]/[266.2mb]}{[survivor] [5.1mb]->[3.7mb]/[33.2mb]}{[old] [7.8gb]->[7.8gb]/[15.6gb]}"}

Note: JVM Heap is set to 32GB, since I have set it to max allowed value Can this cause such issue?
Please look into this and check about what could be the reason and how to overcome this.

Are all nodes running exactly the same version?

Yes, all nodes are running on exactly same version.

Can you please show the full output of the cluster stats API?

cluster/stats API output:

{
   "_nodes" : {
     "total" : 16,
     "successful" : 16,
     "failed" : 0
   },
   "cluster_name" : "elastic-dr-elastic",
   "cluster_uuid" : "btgycnVpQdKbfefT8TOBg",
   "timestamp" : 1639718609354,
   "status" : "yellow",
   "indices" : {
     "count" : 229,
     "shards" : {
       "total" : 2709,
       "primaries" : 1351,
       "replication" : 1.005181347150259,
       "index" : {
         "shards" : {
           "min" : 2,
           "max" : 12,
           "avg" : 11.829694323144105
         },
         "primaries" : {
           "min" : 1,
           "max" : 6,
           "avg" : 5.899563318777292
         },
         "replication" : {
           "min" : 1.0,
           "max" : 8.0,
           "avg" : 1.0305676855895196
         }
       }
     },
     "docs" : {
       "count" : 9,
       "deleted" : 0
     },
     "store" : {
       "size_in_bytes" : 998468
     },
     "fielddata" : {
       "memory_size_in_bytes" : 0,
       "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 0,
      "total_count" : 0,
      "hit_count" : 0,
      "miss_count" : 0,
      "cache_size" : 0,
      "cache_count" : 0,
      "evictions" : 0
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 37,
      "memory_in_bytes" : 65075,
      "terms_memory_in_bytes" : 41354,
      "stored_fields_memory_in_bytes" : 15688,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 4288,
      "points_memory_in_bytes" : 45,
      "doc_values_memory_in_bytes" : 3700,
      "index_writer_memory_in_bytes" : 0,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 16,
      "data" : 10,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 16
    },
    "versions" : [
      "6.6.1"
    ],
     "os" : {
       "available_processors" : 144,
       "allocated_processors" : 144,
       "names" : [
         {
           "name" : "Linux",
           "count" : 16
         }
       ],
       "pretty_names" : [
         {
           "pretty_name" : "CentOS Linux 7 (Core)",
           "count" : 16
         }
       ],
       "mem" : {
         "total_in_bytes" : 1619268435968,
         "free_in_bytes" : 714996961280,
         "used_in_bytes" : 904271474688,
         "free_percent" : 44,
         "used_percent" : 56
       }
     },
     "process" : {
       "cpu" : {
         "percent" : 0
       },
       "open_file_descriptors" : {
         "min" : 602,
         "max" : 1314,
         "avg" : 1009
       }
     },
     "jvm" : {
       "max_uptime_in_millis" : 36969780618,
       "versions" : [
         {
           "version" : "1.8.0_212",
           "vm_name" : "OpenJDK 64-Bit Server VM",
           "vm_version" : "25.212-b04",
           "vm_vendor" : "Oracle Corporation",
           "count" : 16
         }
       ],
       "mem" : {
         "heap_used_in_bytes" : 61973215288,
         "heap_max_in_bytes" : 413382868992
       },
       "threads" : 1535
     },
     "fs" : {
       "total_in_bytes" : 119748558626816,
       "free_in_bytes" : 119656676990976,
       "available_in_bytes" : 119615178137600
     },
     "plugins" : [
       {
         "name" : "ingest-user-agent",
         "version" : "6.6.1",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Ingest processor that extracts information from a user agent",
         "classname" : "org.elasticsearch.ingest.useragent.IngestUserAgentPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "search-guard-6",
         "version" : "6.6.1-24.3",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Provide access control related features for Elasticsearch 6",
         "classname" : "com.floragunn.searchguard.SearchGuardPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "ingest-geoip",
         "version" : "6.6.1",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Ingest processor that uses looksup geo data based on ip adresses using the Maxmind geo database",
         "classname" : "org.elasticsearch.ingest.geoip.IngestGeoIpPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "prometheus-exporter",
         "version" : "6.6.1.0",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Export Elasticsearch metrics to Prometheus",
         "classname" : "org.elasticsearch.plugin.prometheus.PrometheusExporterPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "ingest-attachment",
         "version" : "6.6.1",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Ingest processor that uses Apache Tika to extract contents",
         "classname" : "org.elasticsearch.ingest.attachment.IngestAttachmentPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       }
     ],
     "network_types" : {
       "transport_types" : {
         "com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyTransport" : 16
       },
       "http_types" : {
         "com.floragunn.searchguard.http.SearchGuardHttpServerTransport" : 16
       }
     }
   }
 }

This is pretty much the same thing as your other thread in which your cluster seems heavily overloaded and is running a very old version. You have less than 1MB of data but well over 1000 primary shards.

As mentioned before, you should upgrade as a matter of some urgency. Newer versions will deal with high shard counts better, and also have better diagnostic tooling for working out what else might be going wrong.

In the meantime I expect it will help to delete any indices you don't need.

1 Like

Hi @DavidTurner
Can you please explain in what ways the newer version will handle this issue in a better way and how we can diagnose the dropping pending state issue.

There have been many improvements in the ~3 years since 6.6 was released. The whole concept of "pending state" no longer exists, for instance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.