Cluster state often in yellow: data node and master are running but failed to ping each other

Hello Guys!

Would anyone please help on the strange problem I have worked for few days?

Question in short:
I have a 10 nodes ElasticSearch cluster, running ES 5.1.1.
Recently, almost everyday, cluster is often in yellow status and log shows some data node and master failed to ping each other while services are all running and network is good per the monitoring system.
Master node never changes, but the node which failed to ping master is random, different node at different time.

Details are listed at the end.

questions:

  1. what are potential reason for this or how to find the reason?
  2. how to fix it. :slight_smile:

Details:

Hardware: 10 Azure VM: 16 core, 56G RAM, 11T page blob disk each.
OS: CentOS 7.2
ElasticSearch: V5.1.1
Nubmer of Indexes: 238
Number of shards: 4276, including 1 replica
Data amount: about 62T, including 1 replica

Logs from master node when cluster changed to yellow:

        [2018-04-09T07:54:39,916][INFO ][o.e.c.r.a.AllocationService] [node-3] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300} transport disconnected]).
    [2018-04-09T07:54:39,917][INFO ][o.e.c.s.ClusterService   ] [node-3] removed {{node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300},}, reason: zen-disco-node-failed({node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300}), reason(transport disconnected)[{node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300} transport disconnected]
    [2018-04-09T07:54:39,960][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [node-3] failed to execute on node [SlJIt0cTSYCP5lOELmsYyQ]
    org.elasticsearch.transport.SendRequestTransportException: [node-2][192.168.2.36:12300][cluster:monitor/nodes/info[n]]
            at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:531) ~[elasticsearch-5.1.1.jar:5.1.1]
            at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:465) ~[elasticsearch-5.1.1.jar:5.1.1]
            at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:205) ~[elasticsearch-5.1.1.jar:5.1.1]
           ......

Logs from node which "LEFT" when cluster changed to yellow:

[2018-04-09T07:55:16,749][WARN ][o.e.i.c.IndicesClusterStateService] [node-2] [[fooindexname][9]] marking and sending shard failed due to [shard failure, reason [primary shard [[fooindexname][9], node[SlJIt0cTSYCP5lOELmsYyQ], [P], s[STARTED], a[id=vuSJtoLNRyyGHOnuXSy7Mw]] was demoted while failing replica shard]]
org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [3] did not match current primary term [4]
        at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:280) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:581) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:920) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:458) [elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.1.1.jar:5.1.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]
[2018-04-09T07:55:19,272][INFO ][o.e.d.z.ZenDiscovery     ] [node-2] master_left [{node-3}{05Q9p3FqTPCuwAPH3cby7w}{2Sioy-2vSnq2PfG5mVyWjA}{192.168.2.37}{192.168.2.37:12300}], reason [failed to ping, tried [5] times, each with  maximum [2m] timeout]
[2018-04-09T07:55:19,273][WARN ][o.e.d.z.ZenDiscovery     ] [node-2] master left (reason = failed to ping, tried [5] times, each with  maximum [2m] timeout), current nodes: nodes: 
  ...
   {node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300}, local

[2018-04-09T07:55:19,273][INFO ][o.e.c.s.ClusterService   ] [node-2] removed {{node-3}{05Q9p3FqTPCuwAPH3cby7w}{2Sioy-2vSnq2PfG5mVyWjA}{192.168.2.37}{192.168.2.37:12300},}, reason: master_failed ({node-3}{05Q9p3FqTPCuwAPH3cby7w}{2Sioy-2vSnq2PfG5mVyWjA}{192.168.2.37}{192.168.2.37:12300})
[2018-04-09T07:55:21,483][WARN ][r.suppressed             ] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:161) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:147) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:234) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:174) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:74) ~[elasticsearch-5.1.1.jar:5.1.1]
       ......

and configurations for zen discovery:

discovery.zen.fd.ping_interval: 30s  
discovery.zen.fd.ping_retries : 5
discovery.zen.fd.ping_timeout: 120s
discovery.zen.ping_timeout: 120s
discovery.zen.commit_timeout: 60s
discovery.zen.publish_timeout: 60s
client.transport.ping_timeout: 60s

Found the reason, it's a bug of ES 5.1.1.
Hope anyone else can get there easily.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.