Hi all,
I am setting up Elasticsearch 3-nodes cluster with docker. This is my docker compose file:
version: '2.0'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch-oss:6.3.0
environment:
- cluster.name=test-cluster
- node.name=elastic_1
- ES_JAVA_OPTS=-Xms512m -Xmx512m
- bootstrap.memory_lock=true
- discovery.zen.minimum_master_nodes=2
- discovery.zen.ping.unicast.hosts=elasticsearch,elasticsearch2,elasticsearch3
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- test_es_cluster_data:/usr/share/elasticsearch/data
networks:
- esnet
elasticsearch2:
extends:
file: ./docker-compose.yml
service: elasticsearch
environment:
- node.name=elastic_2
volumes:
- test_es_cluster2_data:/usr/share/elasticsearch/data
elasticsearch3:
extends:
file: ./docker-compose.yml
service: elasticsearch
environment:
- node.name=elastic_3
volumes:
- test_es_cluster3_data:/usr/share/elasticsearch/data
volumes:
test_es_cluster_data:
test_es_cluster2_data:
test_es_cluster3_data:
networks:
esnet:
Once the cluster is up, I kill master (elastic_1) to test failover. I expect new master will be elected, while the cluster should respond to read requests all the time.
Well, master is elected, but the cluster is not responding for pretty long time (~45s).
Please find logs from elastic_2 and elastic_3 after master is stopped (docker stop escluster_elasticsearch_1):
elastic_2:
...
[2018-07-04T14:47:04,495][INFO ][o.e.d.z.ZenDiscovery ] [elastic_2] master_left [{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}], reason [shut_down]
...
[2018-07-04T14:47:04,509][WARN ][o.e.c.NodeConnectionsService] [elastic_2] failed to connect to node {elastic_1}{...}{172.24.0.3}{172.24.0.3:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elastic_1][172.24.0.3:9300] connect_exception
...
[2018-07-04T14:47:07,565][INFO ][o.e.c.s.ClusterApplierService] [elastic_2] detected_master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [4]])
[2018-07-04T14:47:35,301][WARN ][r.suppressed ] path: /_cat/health, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2018-07-04T14:47:53,933][WARN ][o.e.c.s.ClusterApplierService] [elastic_2] cluster state applier task [apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [4]])] took [46.3s] above the warn threshold of 30s
[2018-07-04T14:47:53,934][INFO ][o.e.c.s.ClusterApplierService] [elastic_2] removed {{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300},}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [5]])
[2018-07-04T14:47:56,931][WARN ][o.e.t.TransportService ] [elastic_2] Received response for a request that has timed out, sent [48367ms] ago, timed out [18366ms] ago, action [internal:discovery/zen/fd/master_ping], node [{elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}], id [1035]
elastic_3:
[2018-07-04T14:47:04,494][INFO ][o.e.d.z.ZenDiscovery ] [elastic_3] master_left [{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}], reason [shut_down]
...
[2018-07-04T14:47:04,519][WARN ][o.e.c.NodeConnectionsService] [elastic_3] failed to connect to node {elastic_1}{...}{172.24.0.3}{172.24.0.3:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elastic_1][172.24.0.3:9300] connect_exception
...
[2018-07-04T14:47:07,550][INFO ][o.e.c.s.MasterService ] [elastic_3] zen-disco-elected-as-master ([1] nodes joined)[, ], reason: new_master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}
[2018-07-04T14:47:35,026][WARN ][r.suppressed ] path: /_cat/nodes, params: {v=}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2018-07-04T14:47:37,560][WARN ][o.e.d.z.PublishClusterStateAction] [elastic_3] timed out waiting for all nodes to process published state [4] (timeout [30s], pending nodes: [{elastic_2}{...}{172.24.0.2}{172.24.0.2:9300}])
[2018-07-04T14:47:37,561][INFO ][o.e.c.s.ClusterApplierService] [elastic_3] new_master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [4] source [zen-disco-elected-as-master ([1] nodes joined)[, ]]])
[2018-07-04T14:47:41,021][WARN ][o.e.c.s.MasterService ] [elastic_3] cluster state update task [zen-disco-elected-as-master ([1] nodes joined)[, ]] took [33.4s] above the warn threshold of 30s
[2018-07-04T14:47:41,022][INFO ][o.e.c.s.MasterService ] [elastic_3] zen-disco-node-failed({elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}), reason(transport disconnected), reason: removed {{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300},}
[2018-07-04T14:47:56,929][INFO ][o.e.c.s.ClusterApplierService] [elastic_3] removed {{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300},}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [5] source [zen-disco-node-failed({elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}), reason(transport disconnected)]])
Why does it take so long time for the cluster to stabilize and respond to requests?
It is puzzling, that:
a) new master is elected (elastic_3):
[2018-07-04T14:47:07,550][INFO ] ... [elastic_3] zen-disco-elected-as-master ([1] nodes joined)[, ], reason: new_master {elastic_3}...
b) then, it is detected by elastic_2:
[2018-07-04T14:47:07,565][INFO ] ... [elastic_2] detected_master {elastic_3}...
c) then, master times out on waiting to process published state:
[2018-07-04T14:47:37,560][WARN ] ... [elastic_3] timed out waiting for all nodes to process published state [4] (timeout [30s], pending nodes: [{elastic_2}...])
d) elastic_2 applies cluster state with warn:
[2018-07-04T14:47:53,933][WARN ] ... [elastic_2] cluster state applier task [apply cluster state (from master [master {elastic_3}...])] took [46.3s] above the warn threshold of 30s
What can cause the timeout (c)? All this is run on local machine (no network issues). Am I missing any configuration?
Meanwhile, requesting both elastic_2 and elastic_3 one ends up with MasterNotDiscoveredException. According to the documentation, the cluster is expected to respond (https://www.elastic.co/guide/en/elasticsearch/reference/6.3/modules-discovery-zen.html#no-master-block).
Did anyone experienced this? I would appreciate any advice on this issue.
Regards,
mim