Node disconnects every hour

Omar_Al_Zabir · December 14, 2015, 8:34pm

I see nodes are disconnecting almost regularly every hour:

[2015-12-14 11:31:19,677][INFO ][cluster.service          ] [ec-dyl-3] removed {{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300},}, reason: zen-disco-node_failed({ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300}), reason transport disconnected
[2015-12-14 11:31:23,637][INFO ][cluster.service          ] [ec-dyl-3] added {{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300},}, reason: zen-disco-join(join from node[{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300}])
[2015-12-14 12:39:27,642][INFO ][cluster.service          ] [ec-dyl-3] removed {{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300},}, reason: zen-disco-node_failed({ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300}), reason transport disconnected
[2015-12-14 19:27:54,625][INFO ][cluster.service          ] [ec-dyl-3] removed {{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300},}, reason: zen-disco-node_failed({ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300}), reason transport disconnected

On the ec-rdl-2 node's log, I see a lot of master left error:

[2015-12-14 01:14:35,475][INFO ][discovery.zen            ] [ec-rdl-2] master_left [{ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300}], reason [transport disconnected]
[2015-12-14 01:14:35,476][WARN ][discovery.zen            ] [ec-rdl-2] master left (reason = transport disconnected), current nodes: {{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300},}
[2015-12-14 01:14:35,477][INFO ][cluster.service          ] [ec-rdl-2] removed {{ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300},}, reason: zen-disco-master_failed ({ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300})
[2015-12-14 01:14:38,516][INFO ][cluster.service          ] [ec-rdl-2] detected_master {ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300}, added {{ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300},}, reason: zen-disco-receive(from master [{ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300}])
[2015-12-14 01:29:37,366][INFO ][discovery.zen            ] [ec-rdl-2] master_left [{ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2015-12-14 01:29:37,367][WARN ][discovery.zen            ] [ec-rdl-2] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: {{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300},}
[2015-12-14 01:29:37,367][INFO ][cluster.service          ] [ec-rdl-2] removed {{ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300},}, reason: zen-disco-master_failed ({ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300})
[2015-12-14 01:29:40,443][INFO ][cluster.service          ] [ec-rdl-2] detected_master {ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9

warkolm · December 15, 2015, 1:17am

And are you asking for assistance here, if so what would you like us to do?
It's always good to be explicit

Omar_Al_Zabir · December 15, 2015, 9:28am

Sorry, just asking for help, how do I go about find the root cause?

Christian_Dahlqvist · December 15, 2015, 9:34am

It would be helpful if you provided some additional information. Which version of Elasticsearch are you using? How is the cluster configured? What does your cluster topology look like?

mainec · December 15, 2015, 9:35am

Without having a strong ops background personally: I'd start with reading the logs to see if there's anything suspicious like you already did. From this:

I'd bias my further search towards having a closer look at the network setup. I'd ask question like how are instances linked to each other, are there other periodic network intensive jobs running, etc.

Omar_Al_Zabir · December 23, 2015, 1:51pm

I have found some more clue. I see that every hour, the master node gets dropped. Immediately after that, I get the the translog cannot be deleted exception. For example:

[2015-12-23 05:03:41,890][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}, reason [transport disconnected]
[2015-12-23 05:03:41,892][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-
[2015-12-23 05:03:41,892][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06},}, reason: zen-disco-master_failed ({ec-rdl04910app06}
[2015-12-23 05:03:43,932][INFO ][cluster.service          ] [ec-dyl09026app04] detected_master {ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}{ added {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw},}, 
[2015-12-23 05:04:12,441][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q},}, reason: zen-disco-receive(from master [{ec-rdl04910app06}
[2015-12-23 05:04:17,234][WARN ][index.translog           ] [ec-dyl09026app04] [topbeat-2015.12.23][1] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/topbeat-2015.12.23/1/translog/translog-7465629763806507405.tlog

This is at 5:03. Then at 6:03, same story:

[2015-12-23 06:03:43,619][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}], reason [transport disconnected]
[2015-12-23 06:03:43,620][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q}{10.35.76.37}{10.35.76.37:9300},{ec-dyl09026app04}
[2015-12-23 06:03:43,621][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw} reason: zen-disco-master_failed ({ec-rdl04910app06}
[2015-12-23 06:03:59,567][INFO ][rest.suppressed          ] /_bulk Params: {}
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]

Then at 7, same problem:

[2015-12-23 06:07:06,407][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2015-12-23 06:07:06,408][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q},{ec-dyl09026app04}{p2QuXlyDR6CNvykHSeD2wA}
[2015-12-23 06:07:06,409][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw},}, reason: zen-disco-master_failed ({ec-rdl04910app06}

The master node is across data center. So, I assume some firewall is dropping connection after 60m of inactivity/activity. Looks like the zen discover thing keeps a connection open and that connection gets dropped. Then the cluster thinks the master node has left.

On the master node, here's are the logs at that time, for ex at 5am:

[2015-12-23 05:03:43,903][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [topbeat-2015.12.
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:323)
  Caused by: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected]
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:1096
[2015-12-23 05:03:44,779][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [logstash-vasfulf
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][
Caused by: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected]
[2015-12-23 05:04:12,404][INFO ][cluster.service          ] [ec-rdl04910app06] removed {{ec-dyl0
[2015-12-23 05:04:47,260][INFO ][cluster.service          ] [ec-rdl04910app06] added {{ec-dyl090

Same at 6am, only another index having problem:

[2015-12-23 06:04:53,543][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [topbeat-2015.12.
SendRequestTransportException[[ec-dyl09026app02][10.35.76.37:9300][indices:data/write/bulk[s][r]
Caused by: NodeNotConnectedException[[ec-dyl09026app02][10.35.76.37:9300] Node not connected]
[2015-12-23 06:07:04,900][INFO ][cluster.service          ] [ec-rdl04910app06] removed {{ec-dyl0
[2015-12-23 06:07:04,987][DEBUG][action.admin.cluster.node.info] [ec-rdl04910app06] failed to ex
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][cluster:monitor/nodes/info[

warkolm · December 23, 2015, 8:31pm

This is why we don't support cross DC clusters

Omar_Al_Zabir · December 24, 2015, 11:46am

I see. Is there any other way I can deploy a geographically resilient cluster?

I have made the cross DB node master=false. Still I see the cluster goes into red as the node leaves the cluster every hour and rejoins.

warkolm · December 24, 2015, 10:15pm

Check out https://www.elastic.co/blog/clustering_across_multiple_data_centers

Omar_Al_Zabir · December 25, 2015, 4:23pm

I see that the best way to make geographically distributed ES cluster is to actually setup independent clusters on each data center and then write the same data on each cluster.

I am thinking how best to do it. Can we get beats to write to multiple logstash at the same time? Not the current load balancing, since I need the same data to be written on each datacenter.

Of if I could get beats to write to kafka then kafka can commit the same events to a logstash in each DC.

Omar_Al_Zabir · December 27, 2015, 7:37pm

Now I am having this problem in the same data center. I have two servers next to each other. Each of them sees the other leaving randomly throughout the day.

From Server1 log:

[2015-12-27 19:05:56,978][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300}], reason [transport disconnected]
[2015-12-27 19:05:56,981][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-rdl04910app06}{DFwfJbyFR1a_-qLSrphh-g}{10.187.147.59}{10.187.147.59:9300}{data=false, master=false},{ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300},}
[2015-12-27 19:05:56,982][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300},}, reason: zen-disco-master_failed ({ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300})
[2015-12-27 19:06:00,033][INFO ][cluster.service          ] [ec-dyl09026app04] detected_master {ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300}, added {{ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300},}, reason: zen-disco-receive(from master [{ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300}])
[2015-12-27 19:25:45,324][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300}], reason [failed to ping, tried [5] times, each with  maximum [1m] timeout]
[2015-12-27 19:25:45,325][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = failed to ping, tried [5] times, each with  maximum [1m] timeout), current nodes: {{ec-rdl04910app06}{DFwfJbyFR1a_-qLSrphh-g}{10.187.147.59}{10.187.147.59:9300}{data=false, master=false},{ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300},}

On server2 log, I see no entries around 19:05. But I do see at 19:25.

[2015-12-27 19:25:31,877][INFO ][cluster.service          ] [ec-dyl09026app02] removed {{ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300},}, reason: zen-disco-node_failed({ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300}), reason transport disconnected
[2015-12-27 19:25:48,368][INFO ][cluster.service          ] [ec-dyl09026app02] added {{ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300},}, reason: zen-disco-join(join from node[{ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300}])
[2015-12-27 19:31:03,536][INFO ][cluster.service          ] [ec-dyl09026app02] removed {{ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300},}, reason: zen-disco-node_failed({ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300}), reason transport disconnected
[2015-12-27 19:31:26,966][INFO ][cluster.service          ] [ec-dyl09026app02] added {{ec-dyl09026app04}{_mMM8BqQS2SwVrWMxVQ02g}{10.35.132.143}{10.35.132.143:9300},}, reason: zen-disco-join(join from node[{ec-dyl09026app04}{_mMM8BqQS2SwVrWMxVQ02g}{10.35.132.143}{10.35.132.143:9300}])

Both servers were up and running at that time. I was actively performing queries on the servers.

My config is exactly same on both servers:

cluster.name: ec-cluster
node.name: ec-dyl09026app02
bootstrap.mlockall: true
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["10.35.132.143","10.35.76.37"]
discovery.zen.minimum_master_nodes: 1
discovery.zen.no_master_block: write
discovery.zen.fd.ping_interval: 15s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5



threadpool:
  index:
    type: fixed
    size: 30
    queue_size: 1000
  bulk:
    type: fixed
    size: 30
    queue_size: 1000
  search:
    type: fixed
    size: 100
    queue_size: 200
  get:
    type: fixed
    size: 100
    queue_size: 200

index.number_of_replicas: 1

Please help!

warkolm · December 27, 2015, 8:04pm

It still sounds like a firewall issue.
Can you ask your network admin if the firewall handles long-lived connections?

Christian_Dahlqvist · December 27, 2015, 8:07pm

If you have 2 nodes in the cluster, you should set discovery.zen.minimum_master_nodes to 2 instead of 1 in order to avoid the risk of split-brain scenarios.

Omar_Al_Zabir · December 28, 2015, 9:42am

Hi Mark, thanks, but even if firewall drops a long lived connection, it should re-establish the connection right away, no? It feels like the zen discovery thing does not re-establish connection to do the ping. That's what I believe is causing all my grief.

warkolm · December 28, 2015, 10:13am

It does though, it removes and then re-adds.
What version are you on?

Omar_Al_Zabir · December 28, 2015, 10:16am

ES 2.1.0

When you say it removed and re-adds, does it mean it drops the node and re-adds the node, making many shards go unassigned and have to be assigned again?

This is exactly what is happening. I see cluster goes into yellow every hour and many shards become unassigned.

warkolm · December 28, 2015, 7:53pm

Yep. You could try setting the keep alive mentioned here https://www.elastic.co/guide/en/elasticsearch/reference/2.1/modules-network.html#tcp-settings

Omar_Al_Zabir · December 30, 2015, 12:03pm

Thanks for the pointer. I have found that linux keep-alive settings causes the problem. The default keep alive is too long. Once I have set this settings in sysctl, no more node drop problem anymore:

# Reduce keep alive values to increase frequency of keep alive packets sent across network in order to prevent ElasticSearch nodes losing connection between each other
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 20

Topic		Replies	Views
Frequent disconnects between nodes Elasticsearch	13	2337	July 6, 2017
Nodes randomly disconnected from the ES cluster Elasticsearch	10	7304	November 4, 2022
Very regular disconnect and recover - every 2 hours Elasticsearch	8	1225	July 6, 2017
Nodes being dropped from cluster Elasticsearch	8	1199	March 5, 2021
Max latency between nodes Elasticsearch	6	3167	July 6, 2017

Node disconnects every hour

Related topics