'timed out waiting for all nodes to process published state' and cluster unavailability

Hi all,

I am setting up Elasticsearch 3-nodes cluster with docker. This is my docker compose file:

version: '2.0'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch-oss:6.3.0
    environment:
      - cluster.name=test-cluster
      - node.name=elastic_1
      - ES_JAVA_OPTS=-Xms512m -Xmx512m
      - bootstrap.memory_lock=true
      - discovery.zen.minimum_master_nodes=2
      - discovery.zen.ping.unicast.hosts=elasticsearch,elasticsearch2,elasticsearch3
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - test_es_cluster_data:/usr/share/elasticsearch/data
    networks:
      - esnet
  elasticsearch2:
    extends:
      file: ./docker-compose.yml
      service: elasticsearch
    environment:
      - node.name=elastic_2
    volumes:
      - test_es_cluster2_data:/usr/share/elasticsearch/data
  elasticsearch3:
    extends:
      file: ./docker-compose.yml
      service: elasticsearch
    environment:
      - node.name=elastic_3
    volumes:
      - test_es_cluster3_data:/usr/share/elasticsearch/data
volumes:
  test_es_cluster_data:
  test_es_cluster2_data:
  test_es_cluster3_data:
networks:
  esnet:

Once the cluster is up, I kill master (elastic_1) to test failover. I expect new master will be elected, while the cluster should respond to read requests all the time.

Well, master is elected, but the cluster is not responding for pretty long time (~45s).

Please find logs from elastic_2 and elastic_3 after master is stopped (docker stop escluster_elasticsearch_1):

elastic_2:

...
[2018-07-04T14:47:04,495][INFO ][o.e.d.z.ZenDiscovery     ] [elastic_2] master_left [{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}], reason [shut_down]
...
[2018-07-04T14:47:04,509][WARN ][o.e.c.NodeConnectionsService] [elastic_2] failed to connect to node {elastic_1}{...}{172.24.0.3}{172.24.0.3:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elastic_1][172.24.0.3:9300] connect_exception
	...
[2018-07-04T14:47:07,565][INFO ][o.e.c.s.ClusterApplierService] [elastic_2] detected_master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [4]])
[2018-07-04T14:47:35,301][WARN ][r.suppressed             ] path: /_cat/health, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
	...
[2018-07-04T14:47:53,933][WARN ][o.e.c.s.ClusterApplierService] [elastic_2] cluster state applier task [apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [4]])] took [46.3s] above the warn threshold of 30s
[2018-07-04T14:47:53,934][INFO ][o.e.c.s.ClusterApplierService] [elastic_2] removed {{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300},}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [5]])
[2018-07-04T14:47:56,931][WARN ][o.e.t.TransportService   ] [elastic_2] Received response for a request that has timed out, sent [48367ms] ago, timed out [18366ms] ago, action [internal:discovery/zen/fd/master_ping], node [{elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}], id [1035]

elastic_3:

[2018-07-04T14:47:04,494][INFO ][o.e.d.z.ZenDiscovery     ] [elastic_3] master_left [{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}], reason [shut_down]
...
[2018-07-04T14:47:04,519][WARN ][o.e.c.NodeConnectionsService] [elastic_3] failed to connect to node {elastic_1}{...}{172.24.0.3}{172.24.0.3:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elastic_1][172.24.0.3:9300] connect_exception
	...
[2018-07-04T14:47:07,550][INFO ][o.e.c.s.MasterService    ] [elastic_3] zen-disco-elected-as-master ([1] nodes joined)[, ], reason: new_master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}
[2018-07-04T14:47:35,026][WARN ][r.suppressed             ] path: /_cat/nodes, params: {v=}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
	...
[2018-07-04T14:47:37,560][WARN ][o.e.d.z.PublishClusterStateAction] [elastic_3] timed out waiting for all nodes to process published state [4] (timeout [30s], pending nodes: [{elastic_2}{...}{172.24.0.2}{172.24.0.2:9300}])
[2018-07-04T14:47:37,561][INFO ][o.e.c.s.ClusterApplierService] [elastic_3] new_master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [4] source [zen-disco-elected-as-master ([1] nodes joined)[, ]]])
[2018-07-04T14:47:41,021][WARN ][o.e.c.s.MasterService    ] [elastic_3] cluster state update task [zen-disco-elected-as-master ([1] nodes joined)[, ]] took [33.4s] above the warn threshold of 30s
[2018-07-04T14:47:41,022][INFO ][o.e.c.s.MasterService    ] [elastic_3] zen-disco-node-failed({elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}), reason(transport disconnected), reason: removed {{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300},}
[2018-07-04T14:47:56,929][INFO ][o.e.c.s.ClusterApplierService] [elastic_3] removed {{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300},}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [5] source [zen-disco-node-failed({elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}), reason(transport disconnected)]])

Why does it take so long time for the cluster to stabilize and respond to requests?

It is puzzling, that:

a) new master is elected (elastic_3):

[2018-07-04T14:47:07,550][INFO ] ... [elastic_3] zen-disco-elected-as-master ([1] nodes joined)[, ], reason: new_master {elastic_3}...

b) then, it is detected by elastic_2:

[2018-07-04T14:47:07,565][INFO ] ... [elastic_2] detected_master {elastic_3}...

c) then, master times out on waiting to process published state:

[2018-07-04T14:47:37,560][WARN ] ... [elastic_3] timed out waiting for all nodes to process published state [4] (timeout [30s], pending nodes: [{elastic_2}...])

d) elastic_2 applies cluster state with warn:

[2018-07-04T14:47:53,933][WARN ] ... [elastic_2] cluster state applier task [apply cluster state (from master [master {elastic_3}...])] took [46.3s] above the warn threshold of 30s

What can cause the timeout (c)? All this is run on local machine (no network issues). Am I missing any configuration?

Meanwhile, requesting both elastic_2 and elastic_3 one ends up with MasterNotDiscoveredException. According to the documentation, the cluster is expected to respond (https://www.elastic.co/guide/en/elasticsearch/reference/6.3/modules-discovery-zen.html#no-master-block).

Did anyone experienced this? I would appreciate any advice on this issue.

Regards,
mim

docker stop is too aggressive here - it tears down the node and its network interface before a new node is elected as master. The new node then tries to contact the old master, but because the old master has completely vanished the connection attempt gets no response until it times out.

This is much more common within Docker than without: outside Docker the old master actively and immediately rejects the new master's connection attempt because its network interface is typically still alive.

Thank you, @DavidTurner
When using docker restart, re-election works fine.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.