Current version of ES = 5.1.1
When the cluster goes into red state below are the logs
Master server logs
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [elsdata09][10.10.0.102:9200][cluster:monitor/nodes/stats[n]] disconnected
Data host logs
2019-10-31T04:57:26,790 _boss][T#50] [W] rg.ela.clu.act.sha.ShardStateAction - [UID=] - [test-index][9] no master known for action [internal:cluster/shard/failure] for shard entry [shard id [[test-index][9]], allocation id [8ty2aNbFQgS_SQ5-PA4KDQ], primary term [116], message [failed to perform indices:data/write/bulk[s] on replica [test-index][9], node[I-c1LgUZQQKG6B2NYb70Wg], [R], s[STARTED], a[id=8ty2aNbFQgS_SQ5-PA4KDQ]], failure [RemoteTransportException[[elsdata07][10.10.0.100:9200][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[active primary shard cannot be a replication target before relocation hand off [test-index][9], node[I-c1LgUZQQKG6B2NYb70Wg], [P], s[STARTED], a[id=8ty2aNbFQgS_SQ5-PA4KDQ], state is [STARTED]]; ]]
2019-10-31T04:57:29,419 nect]][T#59] [W] org.ela.dis.zen.UnicastZenPing - [UID=] - [22] failed send ping to {#zen_unicast_65#}{INV4EcdPSh2kNR2IKKYvVA}{elsmaster02}{10.10.0.51:9200}
java.lang.IllegalStateException: handshake failed with {#zen_unicast_65#}{INV4EcdPSh2kNR2IKKYvVA}{elsmaster02}{10.10.0.51:9200}
at org.elasticsearch.transport.TransportService.handshake(TransportService.java:370) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TransportService.connectToNodeLightAndHandshake(TransportService.java:345) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TransportService.connectToNodeLightAndHandshake(TransportService.java:319) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.discovery.zen.UnicastZenPing$2.run(UnicastZenPing.java:473) [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:458) [elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [10.10.0.51:9200][internal:transport/handshake] disconnected
2019-10-31T09:12:20,825 teTask][T#1] [W] org.ela.dis.zen.ZenDiscovery - [UID=] - master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
=================================================================================
I could see this index hold the biggest size amongst others indices in the cluster
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
red open test-index YA33n2zsS_GAlNsWvGhMKA 20 1 363067962 242803982 1006.6gb 508.1gb
and below is the reason
test-index 19 p UNASSIGNED ALLOCATION_FAILED
test-index 19 r UNASSIGNED NODE_LEFT
But the all nodes remains in the cluster when verified with _cat/nodes and when I execute the _cluster/reroute?retry_failed=true it gets allocated and cluster becomes green
refresh interval for this index is 30s and there is heavy indexing on the cluster
discovery.zen.minimum_master_nodes: 2 and there are around 8 data nodes in the cluster
Please help as cluster is frequently going into red .