Node occasionally drops out

We have a situation where a node will occasionally drop out. The master flags it as node left, and starts doing relocation. The node comes back in about 5 min, but it takes a while for everything to settle back to green state.

It acts as if the node was rebooted, but windows up time shows this did not happen. Also, I have a webtest running every 5 min that verifies we have all nodes up. Unless this downtime always fits in the exact 5 min between web tests, it seems that the node is up, just that comm with the cluster is impaired. I do see the node come back in and give an error that it can't find the master.

Is it possible that I am running out of memory or something and that is causing ES to restart?

Using ES 2.3.2. In this scenario 'Shard' is the master, and 'Landslide' is the node that goes down (other two nodes are Electron and Chief Examiner).

Below are snips from my logs, cherry picked for brevity:

@timestamp:November 19th 2016, 04:35:46.233 message:[Electron] failed to execute on node [..]RemoteTransportException[[Shard][10.0.0.5:9300]
@timestamp:November 19th 2016, 04:37:19.179 message:[Electron] failed to execute on node [..] RemoteTransportException[[Chief Examiner][10.0.0.6:9300]

@timestamp:November 19th 2016, 04:38:34.515 message:[Shard] failed to execute on node [..] ReceiveTimeoutTransportException[[Landslide]

@timestamp:November 19th 2016, 04:39:35.447 message:[Shard] Cluster health status changed from [GREEN] to [YELLOW] (reason: [[{Landslide}{..}{10.0.0.4}{10.0.0.4:9300}] failed]).

@timestamp:November 19th 2016, 04:39:35.448 message:[Shard] removed {{Landslide}{..}{10.0.0.4}{10.0.0.4:9300},}, reason: zen-disco-node_failed({Landslide}{..}{10.0.0.4}{10.0.0.4:9300}),
@timestamp:November 19th 2016, 04:39:35.922 message:[Electron] removed {{Landslide}{..}{10.0.0.4}{10.0.0.4:9300},}, reason: zen-disco-receive(from master [{Shard}{..}{10.0.0.5}{10.0.0.5:9300}])
@timestamp:November 19th 2016, 04:39:36.144 message:[Chief Examiner] removed {{Landslide}{..}{10.0.0.4}{10.0.0.4:9300},}, reason: zen-disco-receive(from master [{Shard}{..}{10.0.0.5}{10.0.0.5:9300}])

@timestamp:November 19th 2016, 04:39:36.501 message:[Shard] delaying allocation for [45] unassigned shards, next check in [1m]

@timestamp:November 19th 2016, 04:39:36.638 message:[Shard] [ocv_v10][2] received shard failed for target shard [[ocv_v10][2], node[..],

@timestamp:November 19th 2016, 04:39:36.818 message:[Chief Examiner] failed to execute on node [..] NodeDisconnectedException[[Landslide][10.0.0.4:9300][
@timestamp:November 19th 2016, 04:39:36.939 message:[Electron] failed to execute on node [..] NodeDisconnectedException[[Landslide][10.0.0.4:9300]

@timestamp:November 19th 2016, 04:39:37.140 message:[ocv_v10][[ocv_v10][0]] IllegalIndexShardStateException[CurrentState[STARTED] shard is not a primary]

@timestamp:November 19th 2016, 04:39:38.017 message:[Shard] [ocv_v10][4] received shard failed for target shard [[ocv_v10][4], node[..],

  • @timestamp:November 19th 2016, 04:40:42.198 message:[Landslide] [gc][old][562433][5] duration [2.5m], collections [1]/[2.6m], total [2.5m]/[6.5m], memory [13gb]->[2.1gb]/[13.5gb], all_pools {[young] [3.6gb]->[28.1mb]/[3.7gb]}{[survivor] [110.7mb]->[0b]/[477.8mb]}{[old] [9.3gb]->[2.1gb]/[9.3gb]}

@timestamp:November 19th 2016, 04:40:42.557 message:[Landslide] master_left [{Shard}{..}{10.0.0.5}{10.0.0.5:9300}], reason [failed to ping, tried

@timestamp:November 19th 2016, 04:40:42.557 message:[Landslide] removed {{Shard}{..}{10.0.0.5}{10.0.0.5:9300},}, reason: zen-disco-master_failed

@timestamp:November 19th 2016, 04:40:42.761 message:/_bulk Params: {} ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]

@timestamp:November 19th 2016, 04:40:42.776 message:[Landslide] [127874] Failed to execute fetch phase SendRequestTransportException[[Shard][10.0.0.5:9300]

@timestamp:November 19th 2016, 04:40:46.381 message:[Shard] added {{Landslide}{..}{10.0.0.4}{10.0.0.4:9300},}, reason: zen-disco-join(join from node[{Landslide}{..}{10.0.0.4}{10.0.0.4:9300}])

@timestamp:November 19th 2016, 04:40:46.814 message:[Electron] added {{Landslide}{..}{10.0.0.4}{10.0.0.4:9300},}, reason: zen-disco-receive(from master [{Shard}{..}{10.0.0.5}{10.0.0.5:9300}])

@timestamp:November 19th 2016, 04:40:47.023 message:[Chief Examiner] added {{Landslide}{..}{10.0.0.4}{10.0.0.4:9300},}, reason: zen-disco-receive(from master [{Shard}{..}{10.0.0.5}{10.0.0.5:9300}])

@timestamp:November 19th 2016, 04:40:47.942 message:[Landslide] detected_master {Shard}{..}{10.0.0.5}{10.0.0.5:9300}, added {{Shard}{..}{10.0.0.5}{10.0.0.5:9300},},

@timestamp:November 19th 2016, 04:40:57.054 message:[Landslide] [[taxonomy_v9][4]] marking and sending shard failed due to [failed to create shard] @marking and sending shard failed due to [failed to create shard]

November 19th 2016, 07:06:31.862 [Shard] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[ocv_v9][4]] ...]).

Thanks,
~john

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.