ES 6.2.3 cluster goes down unexpectedly

renjith_ravindran · May 10, 2018, 11:06am

I have a cluster with 2 co-ordinate nodes and 4 data/master nodes.
(1 coordinate node and 2 data/master node in each DC)

I have not started indexing any data, and even index is not created yet. But still the cluster goes down unexpectedly and I am seeing a lot of connect errors in logs.

network.host: 0.0.0.0
http.port: 9200
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 30s
bootstrap.memory_lock: true
node.master: false
node.ingest: false
node.data: false
cluster.routing.allocation.awareness.attributes: dc
transport.publish_host: 10.60.1XX.XX
node.name: ivylx3601
node.attr.dc: ttc
discovery.zen.ping.unicast.hosts:

10.60.1XX.XX ( coordinate)
10.60.2XX.XX
10.60.3XX.XX
10.61.1XX.XX( coordinate)
10.61.2XX.XX
10.61.3XX.XX

[2018-05-10T10:32:59,725][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [ivylx3601] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2018-05-10T10:32:59,725][WARN ][r.suppressed ] path: /_cluster/state/blocks, params: {metric=blocks}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:213) [elasticsearch-6.2.3.jar:6.2.3]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.2.3.jar:6.2.3]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.2.3.jar:6.2.3]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:581) [elasticsearch-6.2.3.jar:6.2.3]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.3.jar:6.2.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2018-05-10T10:32:59,757][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [ivylx3601] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-05-10T10:32:59,757][WARN ][r.suppressed ] path: /_cluster/health, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:213) [elasticsearch-6.2.3.jar:6.2.3]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.2.3.jar:6.2.3]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.2.3.jar:6.2.3]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:581) [elasticsearch-6.2.3.jar:6.2.3]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.3.jar:6.2.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]

Christian_Dahlqvist · May 10, 2018, 11:12am

If you have 4 master/data nodes, this should be set to 3 (not 2) as per these guidelines in order to avoid split-brain scenarios.

How far apart are the two DCs? Elasticsearch required good bandwidth and low latency between nodes, so deployments across multiple DCs that are not very close and well connected is not recommended.

renjith_ravindran · May 10, 2018, 11:21am

We have put it as 3 purposefully in order to avoid the situation where one DC goes down. The DC connectivity is stable and have high bandwidth as well

system · June 7, 2018, 11:21am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster failures Elasticsearch	2	284	July 6, 2017
Elasticsearch cluster: node not able to connect to cluster Elasticsearch	1	847	July 5, 2017
Elasticsearch Cluster issues Elasticsearch	17	3951	May 23, 2019
Timed out while waiting for initial discovery state - timeout: 30s Elasticsearch	11	4872	September 24, 2018
Elasticsearch throws 'not enough master nodes discovered during pinging' error Elasticsearch	4	2697	October 4, 2018

ES 6.2.3 cluster goes down unexpectedly

Related topics