'timed out waiting for all nodes to process published state' and cluster unavailability

mim · July 4, 2018, 3:47pm

Hi all,

I am setting up Elasticsearch 3-nodes cluster with docker. This is my docker compose file:

version: '2.0'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch-oss:6.3.0
    environment:
      - cluster.name=test-cluster
      - node.name=elastic_1
      - ES_JAVA_OPTS=-Xms512m -Xmx512m
      - bootstrap.memory_lock=true
      - discovery.zen.minimum_master_nodes=2
      - discovery.zen.ping.unicast.hosts=elasticsearch,elasticsearch2,elasticsearch3
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - test_es_cluster_data:/usr/share/elasticsearch/data
    networks:
      - esnet
  elasticsearch2:
    extends:
      file: ./docker-compose.yml
      service: elasticsearch
    environment:
      - node.name=elastic_2
    volumes:
      - test_es_cluster2_data:/usr/share/elasticsearch/data
  elasticsearch3:
    extends:
      file: ./docker-compose.yml
      service: elasticsearch
    environment:
      - node.name=elastic_3
    volumes:
      - test_es_cluster3_data:/usr/share/elasticsearch/data
volumes:
  test_es_cluster_data:
  test_es_cluster2_data:
  test_es_cluster3_data:
networks:
  esnet:

Once the cluster is up, I kill master (elastic_1) to test failover. I expect new master will be elected, while the cluster should respond to read requests all the time.

Well, master is elected, but the cluster is not responding for pretty long time (~45s).

Please find logs from elastic_2 and elastic_3 after master is stopped (docker stop escluster_elasticsearch_1):

elastic_2:

...
[2018-07-04T14:47:04,495][INFO ][o.e.d.z.ZenDiscovery     ] [elastic_2] master_left [{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}], reason [shut_down]
...
[2018-07-04T14:47:04,509][WARN ][o.e.c.NodeConnectionsService] [elastic_2] failed to connect to node {elastic_1}{...}{172.24.0.3}{172.24.0.3:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elastic_1][172.24.0.3:9300] connect_exception
	...
[2018-07-04T14:47:07,565][INFO ][o.e.c.s.ClusterApplierService] [elastic_2] detected_master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [4]])
[2018-07-04T14:47:35,301][WARN ][r.suppressed             ] path: /_cat/health, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
	...
[2018-07-04T14:47:53,933][WARN ][o.e.c.s.ClusterApplierService] [elastic_2] cluster state applier task [apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [4]])] took [46.3s] above the warn threshold of 30s
[2018-07-04T14:47:53,934][INFO ][o.e.c.s.ClusterApplierService] [elastic_2] removed {{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300},}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [5]])
[2018-07-04T14:47:56,931][WARN ][o.e.t.TransportService   ] [elastic_2] Received response for a request that has timed out, sent [48367ms] ago, timed out [18366ms] ago, action [internal:discovery/zen/fd/master_ping], node [{elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}], id [1035]

elastic_3:

[2018-07-04T14:47:04,494][INFO ][o.e.d.z.ZenDiscovery     ] [elastic_3] master_left [{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}], reason [shut_down]
...
[2018-07-04T14:47:04,519][WARN ][o.e.c.NodeConnectionsService] [elastic_3] failed to connect to node {elastic_1}{...}{172.24.0.3}{172.24.0.3:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elastic_1][172.24.0.3:9300] connect_exception
	...
[2018-07-04T14:47:07,550][INFO ][o.e.c.s.MasterService    ] [elastic_3] zen-disco-elected-as-master ([1] nodes joined)[, ], reason: new_master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}
[2018-07-04T14:47:35,026][WARN ][r.suppressed             ] path: /_cat/nodes, params: {v=}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
	...
[2018-07-04T14:47:37,560][WARN ][o.e.d.z.PublishClusterStateAction] [elastic_3] timed out waiting for all nodes to process published state [4] (timeout [30s], pending nodes: [{elastic_2}{...}{172.24.0.2}{172.24.0.2:9300}])
[2018-07-04T14:47:37,561][INFO ][o.e.c.s.ClusterApplierService] [elastic_3] new_master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [4] source [zen-disco-elected-as-master ([1] nodes joined)[, ]]])
[2018-07-04T14:47:41,021][WARN ][o.e.c.s.MasterService    ] [elastic_3] cluster state update task [zen-disco-elected-as-master ([1] nodes joined)[, ]] took [33.4s] above the warn threshold of 30s
[2018-07-04T14:47:41,022][INFO ][o.e.c.s.MasterService    ] [elastic_3] zen-disco-node-failed({elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}), reason(transport disconnected), reason: removed {{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300},}
[2018-07-04T14:47:56,929][INFO ][o.e.c.s.ClusterApplierService] [elastic_3] removed {{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300},}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [5] source [zen-disco-node-failed({elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}), reason(transport disconnected)]])

Why does it take so long time for the cluster to stabilize and respond to requests?

It is puzzling, that:

a) new master is elected (elastic_3):

[2018-07-04T14:47:07,550][INFO ] ... [elastic_3] zen-disco-elected-as-master ([1] nodes joined)[, ], reason: new_master {elastic_3}...

b) then, it is detected by elastic_2:

[2018-07-04T14:47:07,565][INFO ] ... [elastic_2] detected_master {elastic_3}...

c) then, master times out on waiting to process published state:

[2018-07-04T14:47:37,560][WARN ] ... [elastic_3] timed out waiting for all nodes to process published state [4] (timeout [30s], pending nodes: [{elastic_2}...])

d) elastic_2 applies cluster state with warn:

[2018-07-04T14:47:53,933][WARN ] ... [elastic_2] cluster state applier task [apply cluster state (from master [master {elastic_3}...])] took [46.3s] above the warn threshold of 30s

What can cause the timeout (c)? All this is run on local machine (no network issues). Am I missing any configuration?

Meanwhile, requesting both elastic_2 and elastic_3 one ends up with MasterNotDiscoveredException. According to the documentation, the cluster is expected to respond (https://www.elastic.co/guide/en/elasticsearch/reference/6.3/modules-discovery-zen.html#no-master-block).

Did anyone experienced this? I would appreciate any advice on this issue.

Regards,
mim

DavidTurner · July 9, 2018, 1:15pm

docker stop is too aggressive here - it tears down the node and its network interface before a new node is elected as master. The new node then tries to contact the old master, but because the old master has completely vanished the connection attempt gets no response until it times out.

This is much more common within Docker than without: outside Docker the old master actively and immediately rejects the new master's connection attempt because its network interface is typically still alive.

mim · July 10, 2018, 8:36am

Thank you, @DavidTurner
When using docker restart, re-election works fine.

system · August 7, 2018, 8:36am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Timed out while waiting for initial discovery state - timeout: 30s Elasticsearch	11	4918	September 24, 2018
Error: master not discovered yet , Elasticsearch cluster using docker swarm with three nodes on three separate servers Elasticsearch docker	2	220	January 7, 2024
Receiving "timed out waiting for all nodes to process published state" Elasticsearch	14	1445	January 18, 2022
When the master node publishes the cluster state, if a certain slave node fails to respond consistently and no timeout is set, does it mean that it will keep waiting and cannot complete? Elasticsearch	3	167	August 2, 2023
Multi node cluster failing to connect Elasticsearch docker	5	748	April 17, 2023

'timed out waiting for all nodes to process published state' and cluster unavailability

Related topics