Cluster state often yellow: data node failed to ping master

We have 6 nodes running ES 5.5.1, a few times a day the cluster is yellow and logs show that a random node which is not master disconnected.
Data node and master failed to ping each other while services are all running and network is good.
It takes around 3 seconds for the node to reconnect and then it is all good.

Very similar to this thread

but the bug they found was in 5.1.1 and I am using 5.5.1

Details:

Hardware : Google compute: 8 CPU, 30GB RAM
OS : Ubuntu 16.4.1
Elasticsearch : V5.5.1
Nubmer of Indexes : 19
Number of shards : 95
Data amount : about 3T

Logs from master:

[2018-09-06T08:03:16,919][INFO ][o.e.c.s.ClusterService   ] [node-prod-2] removed {{node-prod-4}{YGrRouBbReeKIIM8Is
r0Rg}{rrs33D_tTkGUS4v3DkW_LA}{10.150.0.10}{10.150.0.10:9300},}, reason: zen-disco-node-failed({node-prod-4}{YGrRouB
bReeKIIM8Isr0Rg}{rrs33D_tTkGUS4v3DkW_LA}{10.150.0.10}{10.150.0.10:9300}), reason(transport disconnected)[{node-prod
-4}{YGrRouBbReeKIIM8Isr0Rg}{rrs33D_tTkGUS4v3DkW_LA}{10.150.0.10}{10.150.0.10:9300} transport disconnected]
[2018-09-06T08:03:21,584][INFO ][o.e.c.s.ClusterService   ] [node-prod-2] added {{node-prod-4}{YGrRouBbReeKIIM8Isr0
Rg}{rrs33D_tTkGUS4v3DkW_LA}{10.150.0.10}{10.150.0.10:9300},}, reason: zen-disco-node-join[{node-prod-4}{YGrRouBbRee
KIIM8Isr0Rg}{rrs33D_tTkGUS4v3DkW_LA}{10.150.0.10}{10.150.0.10:9300}]

Logs from data node:

[2018-09-06T08:03:18,565][INFO ][o.e.d.z.ZenDiscovery     ] [node-prod-4] master_left [{node-prod-2}{D5YxVRj9RMqneG
8iFRRLMA}{oVN9YJyXQQSMFnuusosekA}{10.150.0.8}{10.150.0.8:9300}], reason [failed to ping, tried [3] times, each with
  maximum [30s] timeout]
[2018-09-06T08:03:18,565][WARN ][o.e.d.z.ZenDiscovery     ] [node-prod-4] master left (reason = failed to ping, tri
ed [3] times, each with  maximum [30s] timeout), current nodes: nodes: 
   {node-prod-1}{wxXC5Rp7StunF8hcyLQLGw}{I97XQScGSzaQCuUzJD9wTQ}{10.150.0.7}{10.150.0.7:9300}
   {node-prod-2}{D5YxVRj9RMqneG8iFRRLMA}{oVN9YJyXQQSMFnuusosekA}{10.150.0.8}{10.150.0.8:9300}, master
   {node-prod-4}{YGrRouBbReeKIIM8Isr0Rg}{rrs33D_tTkGUS4v3DkW_LA}{10.150.0.10}{10.150.0.10:9300}, local
   {node-prod-5}{qKHU59eLQX2j4OoUUr9w7g}{xOxLZHqLRpiz-FMHZW8VOw}{10.150.0.13}{10.150.0.13:9300}
   {node-prod-6}{L9UqFUGZRNCz6DTGwzaAhg}{dmXPfnUhRSGRwtoEJL5i4Q}{10.150.0.14}{10.150.0.14:9300}
   {node-prod-3}{aW1JYgggRveiumCPF26E3g}{gMf63dXlQ9OfMkjnikIZ0g}{10.150.0.9}{10.150.0.9:9300}
[2018-09-06T08:03:21,598][INFO ][o.e.c.s.ClusterService   ] [node-prod-4] detected_master {node-prod-2}{D5YxVRj9RMq
neG8iFRRLMA}{oVN9YJyXQQSMFnuusosekA}{10.150.0.8}{10.150.0.8:9300}, reason: zen-disco-receive(from master [master {n
ode-prod-2}{D5YxVRj9RMqneG8iFRRLMA}{oVN9YJyXQQSMFnuusosekA}{10.150.0.8}{10.150.0.8:9300} committed version [1442]])

What's in the logs before the timeout?

Master logs:

[2018-09-06T08:03:16,833][DEBUG][o.e.a.s.TransportSearchScrollAction] [node-prod-2] [47078634] Failed to execute fetch phase
org.elasticsearch.transport.NodeDisconnectedException: [node-prod-4][10.150.0.10:9300][indices:data/read/search[phase/fetch/id/scroll]] disconnected
[2018-09-06T08:03:16,836][WARN ][o.e.t.TransportService   ] [node-prod-2] Transport response handler not found of id [75250496]
[2018-09-06T08:03:16,919][INFO ][o.e.c.r.a.AllocationService] [node-prod-2] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{node-prod-4}{YGrRouBbReeKIIM8Isr0Rg}{rrs33D_tTkGUS4v3DkW_LA}{10.150.0.10}{10.150.0.10:9300} transport disconnected]).

Data node:

[2018-09-06T08:03:17,433][WARN ][o.e.a.b.TransportShardBulkAction] [node-prod-4] [[<index-name-censored>][4]] failed to perform indices:data/write/bulk[s] on replica [<index-name-censored>][4], node[wxXC5Rp7StunF8hcyLQLGw], [R], s[STARTED], a[id=9790uxRbRl-1-0X6shM1Ug]
org.elasticsearch.transport.RemoteTransportException: [node-prod-1][172.18.0.2:9300][indices:data/write/bulk[s][r]]
Caused by: java.lang.IllegalStateException: active primary shard cannot be a replication target before  relocation hand off [<index-name-censored>][4], node[wxXC5Rp7StunF8hcyLQLGw], [P], s[STARTED], a[id=9790uxRbRl-1-0X6shM1Ug], state is [STARTED]
[2018-09-06T08:03:17,524][DEBUG][o.e.a.s.TransportSearchAction] [node-prod-4] [<index-name-censored>][1], node[D5YxVRj9RMqneG8iFRRLMA], [P], s[STARTED], a[id=-GrxDgpYTbqjWYuVPEgpIg]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[<alias-name-censored>], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, source={...}
org.elasticsearch.transport.RemoteTransportException: [node-prod-2][172.19.0.2:9300][indices:data/read/search[phase/query]]
Caused by: ...
[2018-09-06T08:03:17,748][WARN ][o.e.c.a.s.ShardStateAction] [node-prod-4] [<index-name-censored>][4] unexpected failure while sending request [internal:cluster/shard/failure] to [{node-prod-2}{D5YxVRj9RMqneG8iFRRLMA}{oVN9YJyXQQSMFnuusosekA}{10.150.0.8}{10.150.0.8:9300}] for shard entry [shard id [[<index-name-censored>][4]], allocation id [9790uxRbRl-1-0X6shM1Ug], primary term [26], message [failed to perform indices:data/write/bulk[s] on replica [<index-name-censored>][4], node[wxXC5Rp7StunF8hcyLQLGw], [R], s[STARTED], a[id=9790uxRbRl-1-0X6shM1Ug]], failure [RemoteTransportException[[node-prod-1][172.18.0.2:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[active primary shard cannot be a replication target before  relocation hand off [<index-name-censored>][4], node[wxXC5Rp7StunF8hcyLQLGw], [P], s[STARTED], a[id=9790uxRbRl-1-0X6shM1Ug], state is [STARTED]]; ]]
org.elasticsearch.transport.RemoteTransportException: [node-prod-2][172.19.0.2:9300][internal:cluster/shard/failure]
Caused by: org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [26] did not match current primary term [27]
[2018-09-06T08:03:17,773][WARN ][o.e.i.e.Engine           ] [node-prod-4] [<index-name-censored>][4] failed engine [primary shard [[<index-name-censored>][4], node[YGrRouBbReeKIIM8Isr0Rg], [P], s[STARTED], a[id=lPR4y0sjQTepzlKE246nOA]] was demoted while failing replica shard]
org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [26] did not match current primary term [27]
[2018-09-06T08:03:17,816][WARN ][o.e.i.c.IndicesClusterStateService] [node-prod-4] [[<index-name-censored>][4]] marking and sending shard failed due to [shard failure, reason [primary shard [[<index-name-censored>][4], node[YGrRouBbReeKIIM8Isr0Rg], [P], s[STARTED], a[id=lPR4y0sjQTepzlKE246nOA]] was demoted while failing replica shard]]
org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [26] did not match current primary term [27]

Since this happens several times a day, I checked another data node logs.

In that scenario I didn't see any interesting logs before the timeout.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.