We have 6 nodes running ES 5.5.1, a few times a day the cluster is yellow and logs show that a random node which is not master disconnected.
Data node and master failed to ping each other while services are all running and network is good.
It takes around 3 seconds for the node to reconnect and then it is all good.
Very similar to this thread
but the bug they found was in 5.1.1 and I am using 5.5.1
Details:
Hardware : Google compute: 8 CPU, 30GB RAM
OS : Ubuntu 16.4.1
Elasticsearch : V5.5.1
Nubmer of Indexes : 19
Number of shards : 95
Data amount : about 3T
Logs from master:
[2018-09-06T08:03:16,919][INFO ][o.e.c.s.ClusterService ] [node-prod-2] removed {{node-prod-4}{YGrRouBbReeKIIM8Is
r0Rg}{rrs33D_tTkGUS4v3DkW_LA}{10.150.0.10}{10.150.0.10:9300},}, reason: zen-disco-node-failed({node-prod-4}{YGrRouB
bReeKIIM8Isr0Rg}{rrs33D_tTkGUS4v3DkW_LA}{10.150.0.10}{10.150.0.10:9300}), reason(transport disconnected)[{node-prod
-4}{YGrRouBbReeKIIM8Isr0Rg}{rrs33D_tTkGUS4v3DkW_LA}{10.150.0.10}{10.150.0.10:9300} transport disconnected]
[2018-09-06T08:03:21,584][INFO ][o.e.c.s.ClusterService ] [node-prod-2] added {{node-prod-4}{YGrRouBbReeKIIM8Isr0
Rg}{rrs33D_tTkGUS4v3DkW_LA}{10.150.0.10}{10.150.0.10:9300},}, reason: zen-disco-node-join[{node-prod-4}{YGrRouBbRee
KIIM8Isr0Rg}{rrs33D_tTkGUS4v3DkW_LA}{10.150.0.10}{10.150.0.10:9300}]
Logs from data node:
[2018-09-06T08:03:18,565][INFO ][o.e.d.z.ZenDiscovery ] [node-prod-4] master_left [{node-prod-2}{D5YxVRj9RMqneG
8iFRRLMA}{oVN9YJyXQQSMFnuusosekA}{10.150.0.8}{10.150.0.8:9300}], reason [failed to ping, tried [3] times, each with
maximum [30s] timeout]
[2018-09-06T08:03:18,565][WARN ][o.e.d.z.ZenDiscovery ] [node-prod-4] master left (reason = failed to ping, tri
ed [3] times, each with maximum [30s] timeout), current nodes: nodes:
{node-prod-1}{wxXC5Rp7StunF8hcyLQLGw}{I97XQScGSzaQCuUzJD9wTQ}{10.150.0.7}{10.150.0.7:9300}
{node-prod-2}{D5YxVRj9RMqneG8iFRRLMA}{oVN9YJyXQQSMFnuusosekA}{10.150.0.8}{10.150.0.8:9300}, master
{node-prod-4}{YGrRouBbReeKIIM8Isr0Rg}{rrs33D_tTkGUS4v3DkW_LA}{10.150.0.10}{10.150.0.10:9300}, local
{node-prod-5}{qKHU59eLQX2j4OoUUr9w7g}{xOxLZHqLRpiz-FMHZW8VOw}{10.150.0.13}{10.150.0.13:9300}
{node-prod-6}{L9UqFUGZRNCz6DTGwzaAhg}{dmXPfnUhRSGRwtoEJL5i4Q}{10.150.0.14}{10.150.0.14:9300}
{node-prod-3}{aW1JYgggRveiumCPF26E3g}{gMf63dXlQ9OfMkjnikIZ0g}{10.150.0.9}{10.150.0.9:9300}
[2018-09-06T08:03:21,598][INFO ][o.e.c.s.ClusterService ] [node-prod-4] detected_master {node-prod-2}{D5YxVRj9RMq
neG8iFRRLMA}{oVN9YJyXQQSMFnuusosekA}{10.150.0.8}{10.150.0.8:9300}, reason: zen-disco-receive(from master [master {n
ode-prod-2}{D5YxVRj9RMqneG8iFRRLMA}{oVN9YJyXQQSMFnuusosekA}{10.150.0.8}{10.150.0.8:9300} committed version [1442]])