Node disconnects every hour

Omar_Al_Zabir · December 23, 2015, 1:51pm

I have found some more clue. I see that every hour, the master node gets dropped. Immediately after that, I get the the translog cannot be deleted exception. For example:

[2015-12-23 05:03:41,890][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}, reason [transport disconnected]
[2015-12-23 05:03:41,892][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-
[2015-12-23 05:03:41,892][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06},}, reason: zen-disco-master_failed ({ec-rdl04910app06}
[2015-12-23 05:03:43,932][INFO ][cluster.service          ] [ec-dyl09026app04] detected_master {ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}{ added {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw},}, 
[2015-12-23 05:04:12,441][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q},}, reason: zen-disco-receive(from master [{ec-rdl04910app06}
[2015-12-23 05:04:17,234][WARN ][index.translog           ] [ec-dyl09026app04] [topbeat-2015.12.23][1] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/topbeat-2015.12.23/1/translog/translog-7465629763806507405.tlog

This is at 5:03. Then at 6:03, same story:

[2015-12-23 06:03:43,619][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}], reason [transport disconnected]
[2015-12-23 06:03:43,620][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q}{10.35.76.37}{10.35.76.37:9300},{ec-dyl09026app04}
[2015-12-23 06:03:43,621][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw} reason: zen-disco-master_failed ({ec-rdl04910app06}
[2015-12-23 06:03:59,567][INFO ][rest.suppressed          ] /_bulk Params: {}
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]

Then at 7, same problem:

[2015-12-23 06:07:06,407][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2015-12-23 06:07:06,408][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q},{ec-dyl09026app04}{p2QuXlyDR6CNvykHSeD2wA}
[2015-12-23 06:07:06,409][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw},}, reason: zen-disco-master_failed ({ec-rdl04910app06}

The master node is across data center. So, I assume some firewall is dropping connection after 60m of inactivity/activity. Looks like the zen discover thing keeps a connection open and that connection gets dropped. Then the cluster thinks the master node has left.

On the master node, here's are the logs at that time, for ex at 5am:

[2015-12-23 05:03:43,903][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [topbeat-2015.12.
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:323)
  Caused by: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected]
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:1096
[2015-12-23 05:03:44,779][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [logstash-vasfulf
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][
Caused by: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected]
[2015-12-23 05:04:12,404][INFO ][cluster.service          ] [ec-rdl04910app06] removed {{ec-dyl0
[2015-12-23 05:04:47,260][INFO ][cluster.service          ] [ec-rdl04910app06] added {{ec-dyl090

Same at 6am, only another index having problem:

[2015-12-23 06:04:53,543][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [topbeat-2015.12.
SendRequestTransportException[[ec-dyl09026app02][10.35.76.37:9300][indices:data/write/bulk[s][r]
Caused by: NodeNotConnectedException[[ec-dyl09026app02][10.35.76.37:9300] Node not connected]
[2015-12-23 06:07:04,900][INFO ][cluster.service          ] [ec-rdl04910app06] removed {{ec-dyl0
[2015-12-23 06:07:04,987][DEBUG][action.admin.cluster.node.info] [ec-rdl04910app06] failed to ex
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][cluster:monitor/nodes/info[

Topic		Replies	Views
Frequent disconnects between nodes Elasticsearch	12	2413	August 19, 2013
Nodes being dropped from cluster Elasticsearch	7	1336	February 5, 2021
Nodes randomly disconnected from the ES cluster Elasticsearch	9	7355	January 14, 2020
Node disconnecting randomly Elasticsearch	27	2623	April 1, 2021
Elasticsearch nodes continually disconneting/reconnecting. Resulting in very high number of unassigned shards Elasticsearch	17	3205	August 6, 2020

Node disconnects every hour

Related topics