Elasticsearch 6.1.3 -- failed to discover master after node restart

Hello all,

Our cluster had a node (node3) lose connectivity and now it will not rejoin the cluster. I've used TCPDUMP to study the exchange of traffic and it appears the master node is responding with the heartbeat messages it received from node3 up to three times before giving up. So it would seem there is something wrong with node3. I've tried restarting elasticsearch more than once to no avail.

There are 8 nodes, 6 are data and 2 are ingest only. 3 of the data nodes are set to master. Minimum master nodes is set to 2. Zen discovery is using unicast.

Master nodes are node1, node2, and node3. Node3 had a network issue and was restarted. By all accounts node3 seems fine and is not dropping traffic, extended pings work fine, etc. Node3 attempts to contact the master node but for some reason is not accepting the response from the master nodes and declaring it cannot discover the master.

There are no other logs to diagnose the issue, all other nodes are operating fine.

What do you suggest ?

Thanks for your help !

To illustrate the odd nature of this situation, I can successfully issue a /_cluster/health REST request from node3 to both master nodes.

This is the specific error on node3:

[2018-03-20T10:41:14,023][INFO ][o.e.d.z.ZenDiscovery ] [node3] failed to |
send join request to master [{node2}{}{}{10.1.1.2}{10.1.1.2:9300}{rack=rackA6-1}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]

Is there some sort of debugging option I can enable to diagnose this further ? I could decommission the node so the cluster heals back to green status but I'd like to fix the issue without doing this as it takes a long time to exclude a node and then include it again.

Another likely related phenomenon: the cluster will not respond to the decommission command (cluster.routing.allocation.exclude._ip setting) to remove the failed node and rebalance. It reports the command was successful but nothing changes. After a while I reverted the exclusion.

Our cluster is stuck in a yellow state with 8 unassigned shards and 832 initializing for the past 10 hours.

When we experienced this problem before it was because one of the nodes had an OOM issue. This time it was due to an unknown network cable or port error. A full cluster restart was the only way to recover from it. We are hoping that isn't the case again this time around but at this point we'd be halfway to a full recovery.

If this is a bug specific to our environment then I'd like to help diagnose and resolve it to prevent future occurrences. I'll try and keep the cluster in the current state for as long as possible; if anyone has suggestions, we're all ears.

Here are some logs from earlier this morning but they don't seem useful:

From node1:

[2018-03-20T04:37:40,433][INFO ][o.e.c.s.ClusterApplierService] [node1] removed {{node3}{redacted}{redacted}{10.1.1.3}{10.1.1.3:9300}{rack=rackA6-1},}, reason: apply cluster state (from master [master {node2}{redacted}{redacted}{10.1.1.2}{10.1.1.2:9300}{rack=rackA6-1} committed version [2938]])

From node2 (master):

[2018-03-20T04:34:48,816][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node2] collector [index_recovery] timed out when collecting data

[2018-03-20T04:34:50,636][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node2] failed to execute on node [node3]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node3][10.1.1.3:9300][cluster:monitor/nodes/stats[n]] request_id [4665706100] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:940) (..)
[2018-03-20T04:34:58,829][ERROR][o.e.x.m.c.i.IndexStatsCollector] [node2] collector [index-stats] timed out when collecting data

[2018-03-20T04:35:10,257][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node2] collector [cluster_stats] timed out when collecting data

[2018-03-20T04:35:25,435][INFO ][o.e.m.j.JvmGcMonitorService] [node2] [gc][4093856] overhead, spent [451ms] collecting in the last [1s]

[2018-03-20T04:35:29,125][ERROR][o.e.x.m.c.i.IndexStatsCollector] [node2] collector [index-stats] timed out when collecting data

[2018-03-20T04:35:30,353][WARN ][o.e.t.TransportService ] [node2] Received response for a request that has timed out, sent [59393ms] ago, timed out [29393ms] ago, action [internal:discovery/zen/fd/ping], node [{node3}{redacted}{redacted}{10.1.1.3}{10.1.1.3:9300}{rack=rackA6-1}], id [4665706066]

[2018-03-20T04:35:30,354][WARN ][o.e.t.TransportService ] [node2] Received response for a request that has timed out, sent [54718ms] ago, timed out [39718ms] ago, action [cluster:monitor/nodes/stats[n]], node [{node3}{redacted}{redacted}{10.1.1.3}{10.1.1.3:9300}{rack=rackA6-1}], id [4665706100]

[2018-03-20T04:36:54,383][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node2] failed to execute on node [redacted]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node3][10.1.1.3:9300][cluster:monitor/nodes/stats[n]] request_id [4665744753] timed out after [15000ms]

[2018-03-20T04:37:40,301][INFO ][o.e.c.r.a.AllocationService] [node2] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{node3}{redacted}{redacted}{10.1.1.3}{10.1.1.3:9300}{rack=rackA6-1} failed to ping, tried [3] times, each with maximum [30s] timeout]).

[2018-03-20T04:37:40,301][INFO ][o.e.c.s.MasterService ] [node2] zen-disco-node-failed({node3}{redacted}{redacted}{10.1.1.3}{10.1.1.3:9300}{rack=rackA6-1}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{node3}{redacted}{redacted}{10.1.1.3}{10.1.1.3:9300}{rack=rackA6-1} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{node3}{redacted}{redacted}{10.1.1.3}{10.1.1.3:9300}{rack=rackA6-1},}

This problem happened again.

This time, we migrated the network cable to a new port.

However after restoring network connectivity the cluster will not heal, it's the same issue with "master node not discovered". A full cluster restart is the only way to "fix" it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.