ES 6.2.3 cluster goes down unexpectedly

I have a cluster with 2 co-ordinate nodes and 4 data/master nodes.
(1 coordinate node and 2 data/master node in each DC)

I have not started indexing any data, and even index is not created yet. But still the cluster goes down unexpectedly and I am seeing a lot of connect errors in logs.

network.host: 0.0.0.0
http.port: 9200
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 30s
bootstrap.memory_lock: true
node.master: false
node.ingest: false
node.data: false
cluster.routing.allocation.awareness.attributes: dc
transport.publish_host: 10.60.1XX.XX
node.name: ivylx3601
node.attr.dc: ttc
discovery.zen.ping.unicast.hosts:

  • 10.60.1XX.XX ( coordinate)

  • 10.60.2XX.XX

  • 10.60.3XX.XX

  • 10.61.1XX.XX( coordinate)

  • 10.61.2XX.XX

  • 10.61.3XX.XX

    [2018-05-10T10:32:59,725][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [ivylx3601] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
    [2018-05-10T10:32:59,725][WARN ][r.suppressed ] path: /_cluster/state/blocks, params: {metric=blocks}
    org.elasticsearch.discovery.MasterNotDiscoveredException: null
    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:213) [elasticsearch-6.2.3.jar:6.2.3]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.2.3.jar:6.2.3]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.2.3.jar:6.2.3]
    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:581) [elasticsearch-6.2.3.jar:6.2.3]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.3.jar:6.2.3]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
    [2018-05-10T10:32:59,757][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [ivylx3601] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
    [2018-05-10T10:32:59,757][WARN ][r.suppressed ] path: /_cluster/health, params: {}
    org.elasticsearch.discovery.MasterNotDiscoveredException: null
    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:213) [elasticsearch-6.2.3.jar:6.2.3]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.2.3.jar:6.2.3]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.2.3.jar:6.2.3]
    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:581) [elasticsearch-6.2.3.jar:6.2.3]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.3.jar:6.2.3]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]

If you have 4 master/data nodes, this should be set to 3 (not 2) as per these guidelines in order to avoid split-brain scenarios.

How far apart are the two DCs? Elasticsearch required good bandwidth and low latency between nodes, so deployments across multiple DCs that are not very close and well connected is not recommended.

We have put it as 3 purposefully in order to avoid the situation where one DC goes down. The DC connectivity is stable and have high bandwidth as well

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.