Receiving "timed out waiting for all nodes to process published state"

I am using Elasticsearch v6.6.1. There are total 16 nodes (10 data, 3 master, 3 client).
I see a frequent message in elected master logs:

{"type":"log","host":"Elasticsearch-master-55ff74b94d-ckhfw","level":"WARN", "time": "2021-10-13T00:02:51.273Z", "logger":"o.e.d.z.PublishClusterStateAction","
timezone":"UTC","marker":"[Elasticsearch-master-55ff74b94d-ckhfw] ","log":"timed out waiting for all nodes to process published state [175229398] (timeout [30s], pending nodes: [{Elasticsearch-data-1}{TULFVEcrTxCfINXVazprLA}{Yd1lhNcUQxaKFW8zBhIPjQ}{IP:9300}])"}

And there are frequent message in data-1 node:
{"type":"log","host":"Elasticsearch-data-1","level":"WARN","time": "2021-09-18T15:03:32.407Z","logger":"o.e.d.z.ZenDiscovery","timezone":"UTC","marker":"[Elasticsearch-data-1] ","log":"dropping pending state [[uuid[p2MPdq59QCKtduezFv9Y2A], v[175229293], m[NAlqPclnQY-25G9p6_4mBA]]]. more than [25] pending states."}

The main issue is that no shard is getting assigned on data-1 node after these logs have started coming. This issue is seen only for data-1 node. Other nodes are working fine.
Please suggest a way on how to overcome this.
Thanks.

v6.6.1 was released almost 3 years ago and is long past EOL. This whole area was rewritten in 7.0. You should upgrade.

Okay, We will try to upgrade. However, if you can provide some insight related to these logs and issue, it would be of great help.

What is the full output of the cluster stats API?

Output for cluster stats api is:

{
   "_nodes" : {
     "total" : 16,
     "successful" : 16,
     "failed" : 0
   },
   "cluster_name" : "elastic-dr-elastic",
   "cluster_uuid" : "btgycnVpQdKbfefT8TOBg",
   "timestamp" : 1639718609354,
   "status" : "yellow",
   "indices" : {
     "count" : 229,
     "shards" : {
       "total" : 2709,
       "primaries" : 1351,
       "replication" : 1.005181347150259,
       "index" : {
         "shards" : {
           "min" : 2,
           "max" : 12,
           "avg" : 11.829694323144105
         },
         "primaries" : {
           "min" : 1,
           "max" : 6,
           "avg" : 5.899563318777292
         },
         "replication" : {
           "min" : 1.0,
           "max" : 8.0,
           "avg" : 1.0305676855895196
         }
       }
     },
     "docs" : {
       "count" : 9,
       "deleted" : 0
     },
     "store" : {
       "size_in_bytes" : 998468
     },
     "fielddata" : {
       "memory_size_in_bytes" : 0,
       "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 0,
      "total_count" : 0,
      "hit_count" : 0,
      "miss_count" : 0,
      "cache_size" : 0,
      "cache_count" : 0,
      "evictions" : 0
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 37,
      "memory_in_bytes" : 65075,
      "terms_memory_in_bytes" : 41354,
      "stored_fields_memory_in_bytes" : 15688,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 4288,
      "points_memory_in_bytes" : 45,
      "doc_values_memory_in_bytes" : 3700,
      "index_writer_memory_in_bytes" : 0,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 16,
      "data" : 10,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 16
    },
    "versions" : [
      "6.6.1"
    ],
     "os" : {
       "available_processors" : 144,
       "allocated_processors" : 144,
       "names" : [
         {
           "name" : "Linux",
           "count" : 16
         }
       ],
       "pretty_names" : [
         {
           "pretty_name" : "CentOS Linux 7 (Core)",
           "count" : 16
         }
       ],
       "mem" : {
         "total_in_bytes" : 1619268435968,
         "free_in_bytes" : 714996961280,
         "used_in_bytes" : 904271474688,
         "free_percent" : 44,
         "used_percent" : 56
       }
     },
     "process" : {
       "cpu" : {
         "percent" : 0
       },
       "open_file_descriptors" : {
         "min" : 602,
         "max" : 1314,
         "avg" : 1009
       }
     },
     "jvm" : {
       "max_uptime_in_millis" : 36969780618,
       "versions" : [
         {
           "version" : "1.8.0_212",
           "vm_name" : "OpenJDK 64-Bit Server VM",
           "vm_version" : "25.212-b04",
           "vm_vendor" : "Oracle Corporation",
           "count" : 16
         }
       ],
       "mem" : {
         "heap_used_in_bytes" : 61973215288,
         "heap_max_in_bytes" : 413382868992
       },
       "threads" : 1535
     },
     "fs" : {
       "total_in_bytes" : 119748558626816,
       "free_in_bytes" : 119656676990976,
       "available_in_bytes" : 119615178137600
     },
     "plugins" : [
       {
         "name" : "ingest-user-agent",
         "version" : "6.6.1",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Ingest processor that extracts information from a user agent",
         "classname" : "org.elasticsearch.ingest.useragent.IngestUserAgentPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "search-guard-6",
         "version" : "6.6.1-24.3",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Provide access control related features for Elasticsearch 6",
         "classname" : "com.floragunn.searchguard.SearchGuardPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "ingest-geoip",
         "version" : "6.6.1",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Ingest processor that uses looksup geo data based on ip adresses using the Maxmind geo database",
         "classname" : "org.elasticsearch.ingest.geoip.IngestGeoIpPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "prometheus-exporter",
         "version" : "6.6.1.0",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Export Elasticsearch metrics to Prometheus",
         "classname" : "org.elasticsearch.plugin.prometheus.PrometheusExporterPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "ingest-attachment",
         "version" : "6.6.1",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Ingest processor that uses Apache Tika to extract contents",
         "classname" : "org.elasticsearch.ingest.attachment.IngestAttachmentPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       }
     ],
     "network_types" : {
       "transport_types" : {
         "com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyTransport" : 16
       },
       "http_types" : {
         "com.floragunn.searchguard.http.SearchGuardHttpServerTransport" : 16
       }
     }
   }
 }

Having that many indices, shards and nodes for that few documents seem very weird. Are these stats accurate?

Yes, This is actual output and no modification is done.

It does not look like this cluster is used as there are only 9 documents in it. I would recommend deleting a lot of the empty indices you appear to have.

1 Like

I agree with @Christian_Dahlqvist , you only have 9 docs in your 229 indices , that means you have lots of scope for cleanup.

The below commands will help you get more insights into the issue :

GET _cluster/health?filter_path=status,*_shards

GET _cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state

These index are created on daily basis and should have had documents, however only the index gets generated but data is not getting in.
Output for _cat/shard for on of the index:

dc-log-2021-12-01  1     p      UNASSIGNED             INDEX_CREATED     
dc-log-2021-12-01  1     r      UNASSIGNED             INDEX_CREATED     
dc-log-2021-12-01  0     p      UNASSIGNED             INDEX_CREATED     
dc-log-2021-12-01  0     r      UNASSIGNED             INDEX_CREATED     

can you share output of :

GET /_cluster/allocation/explain

$ curl -XGET "http://elasticsearch.svc:9200/_cluster/allocation/explain?pretty"

{
  "index" : "log-2021-12-17",
  "shard" : 3,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2021-12-17T00:01:24.150Z",
    "last_allocation_status" : "throttled"
  },
  "can_allocate" : "yes",
  "allocate_explanation" : "can allocate the shard",
  "target_node" : {
    "id" : "1Y548FY4SoewnzXwYuIpOA",
    "name" : "data-8",
    "transport_address" : "193.158.46.176:9300"
  },
  "node_allocation_decisions" : [
    {
      "node_id" : "4gt7oqoDSJ6CXLQ-Fa2RyQ",
      "node_name" : "data-2",
      "transport_address" : "192.168.16.194:9300",
      "node_decision" : "yes",
      "weight_ranking" : 2
    },
    {
      "node_id" : "1Y548FY4SoewnzXwYuIpOA",
      "node_name" : "data-8",
      "transport_address" : "192.168.16.120:9300",
      "node_decision" : "yes",
      "weight_ranking" : 3
    },
    {
      "node_id" : "zvBWqqoIRyyxB0_uGqGsFg",
      "node_name" : "data-3",
      "transport_address" : "192.168.16.249:9300",
      "node_decision" : "yes",
      "weight_ranking" : 4
    },
    {
      "node_id" : "Flb-rCjRSwqn4gFSi36J-A",
      "node_name" : "data-7",
      "transport_address" : "192.168.16.118:9300",
      "node_decision" : "yes",
      "weight_ranking" : 5
    },
    {
      "node_id" : "JJa1HLBvQBuK0-Dek-CzyA",
      "node_name" : "data-6",
      "transport_address" : "192.168.16.136:9300",
      "node_decision" : "yes",
      "weight_ranking" : 6
    },
    {
      "node_id" : "lzMtX6vAQ82gsvXBSy8qSA",
      "node_name" : "data-0",
      "transport_address" : "192.168.16.188:9300",
      "node_decision" : "yes",
      "weight_ranking" : 7
    },
    {
      "node_id" : "JAyh0RHBT8GvRAhbjCAc0A",
      "node_name" : "data-5",
      "transport_address" : "192.168.16.138:9300",
      "node_decision" : "yes",
      "weight_ranking" : 8
    },
    {
      "node_id" : "u6JfUVSRTJC2mOjIuIKjpQ",
      "node_name" : "data-9",
      "transport_address" : "192.168.16.13:9300",
      "node_decision" : "yes",
      "weight_ranking" : 9
    },
    {
      "node_id" : "Br15wnqfQvaQfsniAu6czQ",
      "node_name" : "data-4",
      "transport_address" : "192.168.16.68:9300",
      "node_decision" : "yes",
      "weight_ranking" : 10
    },
    {
      "node_id" : "TULFVEcrTxCfINXVazprLA",
      "node_name" : "data-1",
      "transport_address" : "192.168.16.126:9300",
      "node_decision" : "throttled",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of ongoing initial primary recoveries [4], cluster setting [cluster.routing.allocation.node_initial_primaries_recoveries=4]"
        }
      ]
    }
  ]
}

Below is _cat/thread_pool output for 2 of the nodes. I see that active threads are zero. Can this be reason behind the issue. If so, please suggest how can we resolve this:

$ curl -XGET "http://elasticsearch.svc:9200/_cat/thread_pool?v=true&h=node_name,name,active,rejected,completed,queue,core,largest,max,p sz,qs,sz&pretty"
node_name               name                active rejected  completed queue largest max   qs
data-3                  analyze                  0        0          0     0       0   1   16
data-3                  fetch_shard_started      0        0          0     0       0  24   -1
data-3                  fetch_shard_store        0        0       1678     0      24  24   -1
data-3                  flush                    0        0        610     0       5   5   -1
data-3                  force_merge              0        0          0     0       0   1   -1
data-3                  generic                  0        0  155969245     0      48 128   -1
data-3                  get                      0        0          0     0       0  12 1000
data-3                  index                    0        0          0     0       0  12  200
data-3                  listener                 0        0          0     0       0   6   -1
data-3                  management               1        0    9048087     0       5   5   -1
data-3                  refresh                  0        0 4192633568     0       6   6   -1
data-3                  search                   0        0       4277     0      19  19 1000
data-3                  search_throttled         0        0          0     0       0   1  100
data-3                  snapshot                 0        0          0     0       0   5   -1
data-3                  warmer                   0        0          0     0       0   5   -1
data-3                  write                    0        0      46392     0      12  12  200
data-1                  analyze                  0        0          0     0       0   1   16
data-1                  fetch_shard_started      0        0          0     0       0  24   -1
data-1                  fetch_shard_store        0        0        572     0      24  24   -1
data-1                  flush                    0        0          0     0       0   5   -1
data-1                  force_merge              0        0          0     0       0   1   -1
data-1                  generic                  1        0    8300508     0       4 128   -1
data-1                  get                      0        0          0     0       0  12 1000
data-1                  index                    0        0          0     0       0  12  200
data-1                  listener                 0        0          0     0       0   6   -1
data-1                  management               3        0    4129068     0       5   5   -1
data-1                  refresh                  0        0          0     0       0   6   -1
data-1                  search                   0        0          0     0       0  19 1000
data-1                  search_throttled         0        0          0     0       0   1  100
data-1                  snapshot                 0        0          0     0       0   5   -1
data-1                  warmer                   0        0          0     0       0   5   -1
data-1                  write                    0        0          0     0       0  12  200

What is the hardware specification of the. nodes? What type of storage are you using?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.