Cluster state often in yellow: data node and master are running but failed to ping each other

MarkDai · April 10, 2018, 8:02am

Hello Guys!

Would anyone please help on the strange problem I have worked for few days?

Question in short:
I have a 10 nodes ElasticSearch cluster, running ES 5.1.1.
Recently, almost everyday, cluster is often in yellow status and log shows some data node and master failed to ping each other while services are all running and network is good per the monitoring system.
Master node never changes, but the node which failed to ping master is random, different node at different time.

Details are listed at the end.

questions:

what are potential reason for this or how to find the reason?
how to fix it.

Details:

Hardware: 10 Azure VM: 16 core, 56G RAM, 11T page blob disk each.
OS: CentOS 7.2
ElasticSearch: V5.1.1
Nubmer of Indexes: 238
Number of shards: 4276, including 1 replica
Data amount: about 62T, including 1 replica

Logs from master node when cluster changed to yellow:

        [2018-04-09T07:54:39,916][INFO ][o.e.c.r.a.AllocationService] [node-3] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300} transport disconnected]).
    [2018-04-09T07:54:39,917][INFO ][o.e.c.s.ClusterService   ] [node-3] removed {{node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300},}, reason: zen-disco-node-failed({node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300}), reason(transport disconnected)[{node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300} transport disconnected]
    [2018-04-09T07:54:39,960][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [node-3] failed to execute on node [SlJIt0cTSYCP5lOELmsYyQ]
    org.elasticsearch.transport.SendRequestTransportException: [node-2][192.168.2.36:12300][cluster:monitor/nodes/info[n]]
            at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:531) ~[elasticsearch-5.1.1.jar:5.1.1]
            at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:465) ~[elasticsearch-5.1.1.jar:5.1.1]
            at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:205) ~[elasticsearch-5.1.1.jar:5.1.1]
           ......

Logs from node which "LEFT" when cluster changed to yellow:

[2018-04-09T07:55:16,749][WARN ][o.e.i.c.IndicesClusterStateService] [node-2] [[fooindexname][9]] marking and sending shard failed due to [shard failure, reason [primary shard [[fooindexname][9], node[SlJIt0cTSYCP5lOELmsYyQ], [P], s[STARTED], a[id=vuSJtoLNRyyGHOnuXSy7Mw]] was demoted while failing replica shard]]
org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [3] did not match current primary term [4]
        at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:280) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:581) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:920) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:458) [elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.1.1.jar:5.1.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]
[2018-04-09T07:55:19,272][INFO ][o.e.d.z.ZenDiscovery     ] [node-2] master_left [{node-3}{05Q9p3FqTPCuwAPH3cby7w}{2Sioy-2vSnq2PfG5mVyWjA}{192.168.2.37}{192.168.2.37:12300}], reason [failed to ping, tried [5] times, each with  maximum [2m] timeout]
[2018-04-09T07:55:19,273][WARN ][o.e.d.z.ZenDiscovery     ] [node-2] master left (reason = failed to ping, tried [5] times, each with  maximum [2m] timeout), current nodes: nodes: 
  ...
   {node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300}, local

[2018-04-09T07:55:19,273][INFO ][o.e.c.s.ClusterService   ] [node-2] removed {{node-3}{05Q9p3FqTPCuwAPH3cby7w}{2Sioy-2vSnq2PfG5mVyWjA}{192.168.2.37}{192.168.2.37:12300},}, reason: master_failed ({node-3}{05Q9p3FqTPCuwAPH3cby7w}{2Sioy-2vSnq2PfG5mVyWjA}{192.168.2.37}{192.168.2.37:12300})
[2018-04-09T07:55:21,483][WARN ][r.suppressed             ] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:161) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:147) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:234) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:174) ~[elasticsearch-5.1.1.jar:5.1.1]
        at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:74) ~[elasticsearch-5.1.1.jar:5.1.1]
       ......

MarkDai · April 10, 2018, 8:07am

and configurations for zen discovery:

discovery.zen.fd.ping_interval: 30s  
discovery.zen.fd.ping_retries : 5
discovery.zen.fd.ping_timeout: 120s
discovery.zen.ping_timeout: 120s
discovery.zen.commit_timeout: 60s
discovery.zen.publish_timeout: 60s
client.transport.ping_timeout: 60s

MarkDai · April 13, 2018, 6:03am

Found the reason, it's a bug of ES 5.1.1.
Hope anyone else can get there easily.

system · May 11, 2018, 6:03am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster state often yellow: data node failed to ping master Elasticsearch	4	1561	October 5, 2018
Cluster stuck in a yellow state Elasticsearch	16	8667	July 5, 2017
My cluster always yellow Elasticsearch	14	3469	November 4, 2022
ES health api returns green on one node but yellow on another Elasticsearch	4	436	July 6, 2017
Cluster state yellow Elasticsearch	43	744	July 6, 2017

Cluster state often in yellow: data node and master are running but failed to ping each other

Related topics