Receiving "timed out waiting for all nodes to process published state"

pratiksha · December 13, 2021, 2:18pm

I am using Elasticsearch v6.6.1. There are total 16 nodes (10 data, 3 master, 3 client).
I see a frequent message in elected master logs:

{"type":"log","host":"Elasticsearch-master-55ff74b94d-ckhfw","level":"WARN", "time": "2021-10-13T00:02:51.273Z", "logger":"o.e.d.z.PublishClusterStateAction","
timezone":"UTC","marker":"[Elasticsearch-master-55ff74b94d-ckhfw] ","log":"timed out waiting for all nodes to process published state [175229398] (timeout [30s], pending nodes: [{Elasticsearch-data-1}{TULFVEcrTxCfINXVazprLA}{Yd1lhNcUQxaKFW8zBhIPjQ}{IP:9300}])"}

And there are frequent message in data-1 node:
{"type":"log","host":"Elasticsearch-data-1","level":"WARN","time": "2021-09-18T15:03:32.407Z","logger":"o.e.d.z.ZenDiscovery","timezone":"UTC","marker":"[Elasticsearch-data-1] ","log":"dropping pending state [[uuid[p2MPdq59QCKtduezFv9Y2A], v[175229293], m[NAlqPclnQY-25G9p6_4mBA]]]. more than [25] pending states."}

The main issue is that no shard is getting assigned on data-1 node after these logs have started coming. This issue is seen only for data-1 node. Other nodes are working fine.
Please suggest a way on how to overcome this.
Thanks.

DavidTurner · December 13, 2021, 2:53pm

v6.6.1 was released almost 3 years ago and is long past EOL. This whole area was rewritten in 7.0. You should upgrade.

pratiksha · December 13, 2021, 3:37pm

Okay, We will try to upgrade. However, if you can provide some insight related to these logs and issue, it would be of great help.

Christian_Dahlqvist · December 13, 2021, 4:51pm

What is the full output of the cluster stats API?

pratiksha · December 20, 2021, 12:11pm

Output for cluster stats api is:

{
   "_nodes" : {
     "total" : 16,
     "successful" : 16,
     "failed" : 0
   },
   "cluster_name" : "elastic-dr-elastic",
   "cluster_uuid" : "btgycnVpQdKbfefT8TOBg",
   "timestamp" : 1639718609354,
   "status" : "yellow",
   "indices" : {
     "count" : 229,
     "shards" : {
       "total" : 2709,
       "primaries" : 1351,
       "replication" : 1.005181347150259,
       "index" : {
         "shards" : {
           "min" : 2,
           "max" : 12,
           "avg" : 11.829694323144105
         },
         "primaries" : {
           "min" : 1,
           "max" : 6,
           "avg" : 5.899563318777292
         },
         "replication" : {
           "min" : 1.0,
           "max" : 8.0,
           "avg" : 1.0305676855895196
         }
       }
     },
     "docs" : {
       "count" : 9,
       "deleted" : 0
     },
     "store" : {
       "size_in_bytes" : 998468
     },
     "fielddata" : {
       "memory_size_in_bytes" : 0,
       "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 0,
      "total_count" : 0,
      "hit_count" : 0,
      "miss_count" : 0,
      "cache_size" : 0,
      "cache_count" : 0,
      "evictions" : 0
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 37,
      "memory_in_bytes" : 65075,
      "terms_memory_in_bytes" : 41354,
      "stored_fields_memory_in_bytes" : 15688,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 4288,
      "points_memory_in_bytes" : 45,
      "doc_values_memory_in_bytes" : 3700,
      "index_writer_memory_in_bytes" : 0,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 16,
      "data" : 10,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 16
    },
    "versions" : [
      "6.6.1"
    ],
     "os" : {
       "available_processors" : 144,
       "allocated_processors" : 144,
       "names" : [
         {
           "name" : "Linux",
           "count" : 16
         }
       ],
       "pretty_names" : [
         {
           "pretty_name" : "CentOS Linux 7 (Core)",
           "count" : 16
         }
       ],
       "mem" : {
         "total_in_bytes" : 1619268435968,
         "free_in_bytes" : 714996961280,
         "used_in_bytes" : 904271474688,
         "free_percent" : 44,
         "used_percent" : 56
       }
     },
     "process" : {
       "cpu" : {
         "percent" : 0
       },
       "open_file_descriptors" : {
         "min" : 602,
         "max" : 1314,
         "avg" : 1009
       }
     },
     "jvm" : {
       "max_uptime_in_millis" : 36969780618,
       "versions" : [
         {
           "version" : "1.8.0_212",
           "vm_name" : "OpenJDK 64-Bit Server VM",
           "vm_version" : "25.212-b04",
           "vm_vendor" : "Oracle Corporation",
           "count" : 16
         }
       ],
       "mem" : {
         "heap_used_in_bytes" : 61973215288,
         "heap_max_in_bytes" : 413382868992
       },
       "threads" : 1535
     },
     "fs" : {
       "total_in_bytes" : 119748558626816,
       "free_in_bytes" : 119656676990976,
       "available_in_bytes" : 119615178137600
     },
     "plugins" : [
       {
         "name" : "ingest-user-agent",
         "version" : "6.6.1",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Ingest processor that extracts information from a user agent",
         "classname" : "org.elasticsearch.ingest.useragent.IngestUserAgentPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "search-guard-6",
         "version" : "6.6.1-24.3",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Provide access control related features for Elasticsearch 6",
         "classname" : "com.floragunn.searchguard.SearchGuardPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "ingest-geoip",
         "version" : "6.6.1",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Ingest processor that uses looksup geo data based on ip adresses using the Maxmind geo database",
         "classname" : "org.elasticsearch.ingest.geoip.IngestGeoIpPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "prometheus-exporter",
         "version" : "6.6.1.0",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Export Elasticsearch metrics to Prometheus",
         "classname" : "org.elasticsearch.plugin.prometheus.PrometheusExporterPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       },
       {
         "name" : "ingest-attachment",
         "version" : "6.6.1",
         "elasticsearch_version" : "6.6.1",
         "java_version" : "1.8",
         "description" : "Ingest processor that uses Apache Tika to extract contents",
         "classname" : "org.elasticsearch.ingest.attachment.IngestAttachmentPlugin",
         "extended_plugins" : [ ],
         "has_native_controller" : false
       }
     ],
     "network_types" : {
       "transport_types" : {
         "com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyTransport" : 16
       },
       "http_types" : {
         "com.floragunn.searchguard.http.SearchGuardHttpServerTransport" : 16
       }
     }
   }
 }

Christian_Dahlqvist · December 20, 2021, 12:25pm

Having that many indices, shards and nodes for that few documents seem very weird. Are these stats accurate?

pratiksha · December 20, 2021, 12:41pm

Yes, This is actual output and no modification is done.

Christian_Dahlqvist · December 20, 2021, 1:18pm

It does not look like this cluster is used as there are only 9 documents in it. I would recommend deleting a lot of the empty indices you appear to have.

DineshNaik · December 20, 2021, 7:02pm

I agree with @Christian_Dahlqvist , you only have 9 docs in your 229 indices , that means you have lots of scope for cleanup.

The below commands will help you get more insights into the issue :

GET _cluster/health?filter_path=status,*_shards

GET _cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state

pratiksha · December 21, 2021, 5:29am

These index are created on daily basis and should have had documents, however only the index gets generated but data is not getting in.
Output for _cat/shard for on of the index:

dc-log-2021-12-01  1     p      UNASSIGNED             INDEX_CREATED     
dc-log-2021-12-01  1     r      UNASSIGNED             INDEX_CREATED     
dc-log-2021-12-01  0     p      UNASSIGNED             INDEX_CREATED     
dc-log-2021-12-01  0     r      UNASSIGNED             INDEX_CREATED

DineshNaik · December 21, 2021, 6:37am

can you share output of :

GET /_cluster/allocation/explain

pratiksha · December 21, 2021, 6:44am

$ curl -XGET "http://elasticsearch.svc:9200/_cluster/allocation/explain?pretty"

{
  "index" : "log-2021-12-17",
  "shard" : 3,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2021-12-17T00:01:24.150Z",
    "last_allocation_status" : "throttled"
  },
  "can_allocate" : "yes",
  "allocate_explanation" : "can allocate the shard",
  "target_node" : {
    "id" : "1Y548FY4SoewnzXwYuIpOA",
    "name" : "data-8",
    "transport_address" : "193.158.46.176:9300"
  },
  "node_allocation_decisions" : [
    {
      "node_id" : "4gt7oqoDSJ6CXLQ-Fa2RyQ",
      "node_name" : "data-2",
      "transport_address" : "192.168.16.194:9300",
      "node_decision" : "yes",
      "weight_ranking" : 2
    },
    {
      "node_id" : "1Y548FY4SoewnzXwYuIpOA",
      "node_name" : "data-8",
      "transport_address" : "192.168.16.120:9300",
      "node_decision" : "yes",
      "weight_ranking" : 3
    },
    {
      "node_id" : "zvBWqqoIRyyxB0_uGqGsFg",
      "node_name" : "data-3",
      "transport_address" : "192.168.16.249:9300",
      "node_decision" : "yes",
      "weight_ranking" : 4
    },
    {
      "node_id" : "Flb-rCjRSwqn4gFSi36J-A",
      "node_name" : "data-7",
      "transport_address" : "192.168.16.118:9300",
      "node_decision" : "yes",
      "weight_ranking" : 5
    },
    {
      "node_id" : "JJa1HLBvQBuK0-Dek-CzyA",
      "node_name" : "data-6",
      "transport_address" : "192.168.16.136:9300",
      "node_decision" : "yes",
      "weight_ranking" : 6
    },
    {
      "node_id" : "lzMtX6vAQ82gsvXBSy8qSA",
      "node_name" : "data-0",
      "transport_address" : "192.168.16.188:9300",
      "node_decision" : "yes",
      "weight_ranking" : 7
    },
    {
      "node_id" : "JAyh0RHBT8GvRAhbjCAc0A",
      "node_name" : "data-5",
      "transport_address" : "192.168.16.138:9300",
      "node_decision" : "yes",
      "weight_ranking" : 8
    },
    {
      "node_id" : "u6JfUVSRTJC2mOjIuIKjpQ",
      "node_name" : "data-9",
      "transport_address" : "192.168.16.13:9300",
      "node_decision" : "yes",
      "weight_ranking" : 9
    },
    {
      "node_id" : "Br15wnqfQvaQfsniAu6czQ",
      "node_name" : "data-4",
      "transport_address" : "192.168.16.68:9300",
      "node_decision" : "yes",
      "weight_ranking" : 10
    },
    {
      "node_id" : "TULFVEcrTxCfINXVazprLA",
      "node_name" : "data-1",
      "transport_address" : "192.168.16.126:9300",
      "node_decision" : "throttled",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of ongoing initial primary recoveries [4], cluster setting [cluster.routing.allocation.node_initial_primaries_recoveries=4]"
        }
      ]
    }
  ]
}

pratiksha · December 21, 2021, 7:42am

Below is _cat/thread_pool output for 2 of the nodes. I see that active threads are zero. Can this be reason behind the issue. If so, please suggest how can we resolve this:

$ curl -XGET "http://elasticsearch.svc:9200/_cat/thread_pool?v=true&h=node_name,name,active,rejected,completed,queue,core,largest,max,p sz,qs,sz&pretty"
node_name               name                active rejected  completed queue largest max   qs
data-3                  analyze                  0        0          0     0       0   1   16
data-3                  fetch_shard_started      0        0          0     0       0  24   -1
data-3                  fetch_shard_store        0        0       1678     0      24  24   -1
data-3                  flush                    0        0        610     0       5   5   -1
data-3                  force_merge              0        0          0     0       0   1   -1
data-3                  generic                  0        0  155969245     0      48 128   -1
data-3                  get                      0        0          0     0       0  12 1000
data-3                  index                    0        0          0     0       0  12  200
data-3                  listener                 0        0          0     0       0   6   -1
data-3                  management               1        0    9048087     0       5   5   -1
data-3                  refresh                  0        0 4192633568     0       6   6   -1
data-3                  search                   0        0       4277     0      19  19 1000
data-3                  search_throttled         0        0          0     0       0   1  100
data-3                  snapshot                 0        0          0     0       0   5   -1
data-3                  warmer                   0        0          0     0       0   5   -1
data-3                  write                    0        0      46392     0      12  12  200
data-1                  analyze                  0        0          0     0       0   1   16
data-1                  fetch_shard_started      0        0          0     0       0  24   -1
data-1                  fetch_shard_store        0        0        572     0      24  24   -1
data-1                  flush                    0        0          0     0       0   5   -1
data-1                  force_merge              0        0          0     0       0   1   -1
data-1                  generic                  1        0    8300508     0       4 128   -1
data-1                  get                      0        0          0     0       0  12 1000
data-1                  index                    0        0          0     0       0  12  200
data-1                  listener                 0        0          0     0       0   6   -1
data-1                  management               3        0    4129068     0       5   5   -1
data-1                  refresh                  0        0          0     0       0   6   -1
data-1                  search                   0        0          0     0       0  19 1000
data-1                  search_throttled         0        0          0     0       0   1  100
data-1                  snapshot                 0        0          0     0       0   5   -1
data-1                  warmer                   0        0          0     0       0   5   -1
data-1                  write                    0        0          0     0       0  12  200

Christian_Dahlqvist · December 21, 2021, 7:44am

What is the hardware specification of the. nodes? What type of storage are you using?

system · January 18, 2022, 7:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic errors in logs Elasticsearch	2	1588	July 5, 2017
How to change the timeout about process published state default 30s Elasticsearch	2	1794	September 24, 2017
'timed out waiting for all nodes to process published state' and cluster unavailability Elasticsearch	3	4528	August 7, 2018
ElasticSearch : observer: timeout notification from cluster service Elasticsearch	10	11155	July 5, 2017
Discovery.zen.publish reports pending node, cluster gets red at index create Elasticsearch	3	1449	July 6, 2017

Receiving "timed out waiting for all nodes to process published state"

Related topics