Hello Guys!
Would anyone please help on the strange problem I have worked for few days?
Question in short:
I have a 10 nodes ElasticSearch cluster, running ES 5.1.1.
Recently, almost everyday, cluster is often in yellow status and log shows some data node and master failed to ping each other while services are all running and network is good per the monitoring system.
Master node never changes, but the node which failed to ping master is random, different node at different time.
Details are listed at the end.
questions:
- what are potential reason for this or how to find the reason?
- how to fix it.
Details:
Hardware: 10 Azure VM: 16 core, 56G RAM, 11T page blob disk each.
OS: CentOS 7.2
ElasticSearch: V5.1.1
Nubmer of Indexes: 238
Number of shards: 4276, including 1 replica
Data amount: about 62T, including 1 replica
Logs from master node when cluster changed to yellow:
[2018-04-09T07:54:39,916][INFO ][o.e.c.r.a.AllocationService] [node-3] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300} transport disconnected]).
[2018-04-09T07:54:39,917][INFO ][o.e.c.s.ClusterService ] [node-3] removed {{node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300},}, reason: zen-disco-node-failed({node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300}), reason(transport disconnected)[{node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300} transport disconnected]
[2018-04-09T07:54:39,960][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [node-3] failed to execute on node [SlJIt0cTSYCP5lOELmsYyQ]
org.elasticsearch.transport.SendRequestTransportException: [node-2][192.168.2.36:12300][cluster:monitor/nodes/info[n]]
at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:531) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:465) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:205) ~[elasticsearch-5.1.1.jar:5.1.1]
......
Logs from node which "LEFT" when cluster changed to yellow:
[2018-04-09T07:55:16,749][WARN ][o.e.i.c.IndicesClusterStateService] [node-2] [[fooindexname][9]] marking and sending shard failed due to [shard failure, reason [primary shard [[fooindexname][9], node[SlJIt0cTSYCP5lOELmsYyQ], [P], s[STARTED], a[id=vuSJtoLNRyyGHOnuXSy7Mw]] was demoted while failing replica shard]]
org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [3] did not match current primary term [4]
at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:280) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:581) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:920) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:458) [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]
[2018-04-09T07:55:19,272][INFO ][o.e.d.z.ZenDiscovery ] [node-2] master_left [{node-3}{05Q9p3FqTPCuwAPH3cby7w}{2Sioy-2vSnq2PfG5mVyWjA}{192.168.2.37}{192.168.2.37:12300}], reason [failed to ping, tried [5] times, each with maximum [2m] timeout]
[2018-04-09T07:55:19,273][WARN ][o.e.d.z.ZenDiscovery ] [node-2] master left (reason = failed to ping, tried [5] times, each with maximum [2m] timeout), current nodes: nodes:
...
{node-2}{SlJIt0cTSYCP5lOELmsYyQ}{pP7NdDfIR6Gzg-tATkuLhg}{192.168.2.36}{192.168.2.36:12300}, local
[2018-04-09T07:55:19,273][INFO ][o.e.c.s.ClusterService ] [node-2] removed {{node-3}{05Q9p3FqTPCuwAPH3cby7w}{2Sioy-2vSnq2PfG5mVyWjA}{192.168.2.37}{192.168.2.37:12300},}, reason: master_failed ({node-3}{05Q9p3FqTPCuwAPH3cby7w}{2Sioy-2vSnq2PfG5mVyWjA}{192.168.2.37}{192.168.2.37:12300})
[2018-04-09T07:55:21,483][WARN ][r.suppressed ] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:161) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:147) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:234) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:174) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:74) ~[elasticsearch-5.1.1.jar:5.1.1]
......