Node disconnects every hour

I see nodes are disconnecting almost regularly every hour:

[2015-12-14 11:31:19,677][INFO ][cluster.service          ] [ec-dyl-3] removed {{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300},}, reason: zen-disco-node_failed({ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300}), reason transport disconnected
[2015-12-14 11:31:23,637][INFO ][cluster.service          ] [ec-dyl-3] added {{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300},}, reason: zen-disco-join(join from node[{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300}])
[2015-12-14 12:39:27,642][INFO ][cluster.service          ] [ec-dyl-3] removed {{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300},}, reason: zen-disco-node_failed({ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300}), reason transport disconnected
[2015-12-14 19:27:54,625][INFO ][cluster.service          ] [ec-dyl-3] removed {{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300},}, reason: zen-disco-node_failed({ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300}), reason transport disconnected

On the ec-rdl-2 node's log, I see a lot of master left error:

[2015-12-14 01:14:35,475][INFO ][discovery.zen            ] [ec-rdl-2] master_left [{ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300}], reason [transport disconnected]
[2015-12-14 01:14:35,476][WARN ][discovery.zen            ] [ec-rdl-2] master left (reason = transport disconnected), current nodes: {{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300},}
[2015-12-14 01:14:35,477][INFO ][cluster.service          ] [ec-rdl-2] removed {{ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300},}, reason: zen-disco-master_failed ({ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300})
[2015-12-14 01:14:38,516][INFO ][cluster.service          ] [ec-rdl-2] detected_master {ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300}, added {{ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300},}, reason: zen-disco-receive(from master [{ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300}])
[2015-12-14 01:29:37,366][INFO ][discovery.zen            ] [ec-rdl-2] master_left [{ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2015-12-14 01:29:37,367][WARN ][discovery.zen            ] [ec-rdl-2] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: {{ec-rdl-2}{sNyCgx2HQUuJEJ8jb_d9Yg}{10.187.147.59}{10.187.147.59:9300},}
[2015-12-14 01:29:37,367][INFO ][cluster.service          ] [ec-rdl-2] removed {{ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300},}, reason: zen-disco-master_failed ({ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9300})
[2015-12-14 01:29:40,443][INFO ][cluster.service          ] [ec-rdl-2] detected_master {ec-dyl-3}{ezyh-vmeTYegJsYs1417XQ}{10.35.76.37}{10.35.76.37:9

And are you asking for assistance here, if so what would you like us to do?
It's always good to be explicit :slight_smile:

Sorry, just asking for help, how do I go about find the root cause?

It would be helpful if you provided some additional information. Which version of Elasticsearch are you using? How is the cluster configured? What does your cluster topology look like?

Without having a strong ops background personally: I'd start with reading the logs to see if there's anything suspicious like you already did. From this:

I'd bias my further search towards having a closer look at the network setup. I'd ask question like how are instances linked to each other, are there other periodic network intensive jobs running, etc.

I have found some more clue. I see that every hour, the master node gets dropped. Immediately after that, I get the the translog cannot be deleted exception. For example:

[2015-12-23 05:03:41,890][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}, reason [transport disconnected]
[2015-12-23 05:03:41,892][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-
[2015-12-23 05:03:41,892][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06},}, reason: zen-disco-master_failed ({ec-rdl04910app06}
[2015-12-23 05:03:43,932][INFO ][cluster.service          ] [ec-dyl09026app04] detected_master {ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}{ added {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw},}, 
[2015-12-23 05:04:12,441][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q},}, reason: zen-disco-receive(from master [{ec-rdl04910app06}
[2015-12-23 05:04:17,234][WARN ][index.translog           ] [ec-dyl09026app04] [topbeat-2015.12.23][1] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/topbeat-2015.12.23/1/translog/translog-7465629763806507405.tlog

This is at 5:03. Then at 6:03, same story:

[2015-12-23 06:03:43,619][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}], reason [transport disconnected]
[2015-12-23 06:03:43,620][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q}{10.35.76.37}{10.35.76.37:9300},{ec-dyl09026app04}
[2015-12-23 06:03:43,621][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw} reason: zen-disco-master_failed ({ec-rdl04910app06}
[2015-12-23 06:03:59,567][INFO ][rest.suppressed          ] /_bulk Params: {}
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]

Then at 7, same problem:

[2015-12-23 06:07:06,407][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2015-12-23 06:07:06,408][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q},{ec-dyl09026app04}{p2QuXlyDR6CNvykHSeD2wA}
[2015-12-23 06:07:06,409][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw},}, reason: zen-disco-master_failed ({ec-rdl04910app06}

The master node is across data center. So, I assume some firewall is dropping connection after 60m of inactivity/activity. Looks like the zen discover thing keeps a connection open and that connection gets dropped. Then the cluster thinks the master node has left.

On the master node, here's are the logs at that time, for ex at 5am:

[2015-12-23 05:03:43,903][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [topbeat-2015.12.
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:323)
  Caused by: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected]
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:1096
[2015-12-23 05:03:44,779][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [logstash-vasfulf
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][
Caused by: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected]
[2015-12-23 05:04:12,404][INFO ][cluster.service          ] [ec-rdl04910app06] removed {{ec-dyl0
[2015-12-23 05:04:47,260][INFO ][cluster.service          ] [ec-rdl04910app06] added {{ec-dyl090

Same at 6am, only another index having problem:

[2015-12-23 06:04:53,543][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [topbeat-2015.12.
SendRequestTransportException[[ec-dyl09026app02][10.35.76.37:9300][indices:data/write/bulk[s][r]
Caused by: NodeNotConnectedException[[ec-dyl09026app02][10.35.76.37:9300] Node not connected]
[2015-12-23 06:07:04,900][INFO ][cluster.service          ] [ec-rdl04910app06] removed {{ec-dyl0
[2015-12-23 06:07:04,987][DEBUG][action.admin.cluster.node.info] [ec-rdl04910app06] failed to ex
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][cluster:monitor/nodes/info[

This is why we don't support cross DC clusters :slight_smile:

I see. Is there any other way I can deploy a geographically resilient cluster?

I have made the cross DB node master=false. Still I see the cluster goes into red as the node leaves the cluster every hour and rejoins.

Check out https://www.elastic.co/blog/clustering_across_multiple_data_centers

I see that the best way to make geographically distributed ES cluster is to actually setup independent clusters on each data center and then write the same data on each cluster.

I am thinking how best to do it. Can we get beats to write to multiple logstash at the same time? Not the current load balancing, since I need the same data to be written on each datacenter.

Of if I could get beats to write to kafka then kafka can commit the same events to a logstash in each DC.

Now I am having this problem in the same data center. I have two servers next to each other. Each of them sees the other leaving randomly throughout the day.

From Server1 log:

[2015-12-27 19:05:56,978][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300}], reason [transport disconnected]
[2015-12-27 19:05:56,981][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-rdl04910app06}{DFwfJbyFR1a_-qLSrphh-g}{10.187.147.59}{10.187.147.59:9300}{data=false, master=false},{ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300},}
[2015-12-27 19:05:56,982][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300},}, reason: zen-disco-master_failed ({ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300})
[2015-12-27 19:06:00,033][INFO ][cluster.service          ] [ec-dyl09026app04] detected_master {ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300}, added {{ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300},}, reason: zen-disco-receive(from master [{ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300}])
[2015-12-27 19:25:45,324][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-dyl09026app02}{LfX2UgpuSvqb2oGv36YXhA}{10.35.76.37}{10.35.76.37:9300}], reason [failed to ping, tried [5] times, each with  maximum [1m] timeout]
[2015-12-27 19:25:45,325][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = failed to ping, tried [5] times, each with  maximum [1m] timeout), current nodes: {{ec-rdl04910app06}{DFwfJbyFR1a_-qLSrphh-g}{10.187.147.59}{10.187.147.59:9300}{data=false, master=false},{ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300},}

On server2 log, I see no entries around 19:05. But I do see at 19:25.

[2015-12-27 19:25:31,877][INFO ][cluster.service          ] [ec-dyl09026app02] removed {{ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300},}, reason: zen-disco-node_failed({ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300}), reason transport disconnected
[2015-12-27 19:25:48,368][INFO ][cluster.service          ] [ec-dyl09026app02] added {{ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300},}, reason: zen-disco-join(join from node[{ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300}])
[2015-12-27 19:31:03,536][INFO ][cluster.service          ] [ec-dyl09026app02] removed {{ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300},}, reason: zen-disco-node_failed({ec-dyl09026app04}{XxiThzzRQbqXzQTWmA2HgQ}{10.35.132.143}{10.35.132.143:9300}), reason transport disconnected
[2015-12-27 19:31:26,966][INFO ][cluster.service          ] [ec-dyl09026app02] added {{ec-dyl09026app04}{_mMM8BqQS2SwVrWMxVQ02g}{10.35.132.143}{10.35.132.143:9300},}, reason: zen-disco-join(join from node[{ec-dyl09026app04}{_mMM8BqQS2SwVrWMxVQ02g}{10.35.132.143}{10.35.132.143:9300}])

Both servers were up and running at that time. I was actively performing queries on the servers.

My config is exactly same on both servers:

cluster.name: ec-cluster
node.name: ec-dyl09026app02
bootstrap.mlockall: true
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["10.35.132.143","10.35.76.37"]
discovery.zen.minimum_master_nodes: 1
discovery.zen.no_master_block: write
discovery.zen.fd.ping_interval: 15s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5



threadpool:
  index:
    type: fixed
    size: 30
    queue_size: 1000
  bulk:
    type: fixed
    size: 30
    queue_size: 1000
  search:
    type: fixed
    size: 100
    queue_size: 200
  get:
    type: fixed
    size: 100
    queue_size: 200

index.number_of_replicas: 1

Please help!

It still sounds like a firewall issue.
Can you ask your network admin if the firewall handles long-lived connections?

If you have 2 nodes in the cluster, you should set discovery.zen.minimum_master_nodes to 2 instead of 1 in order to avoid the risk of split-brain scenarios.

Hi Mark, thanks, but even if firewall drops a long lived connection, it should re-establish the connection right away, no? It feels like the zen discovery thing does not re-establish connection to do the ping. That's what I believe is causing all my grief.

It does though, it removes and then re-adds.
What version are you on?

ES 2.1.0

When you say it removed and re-adds, does it mean it drops the node and re-adds the node, making many shards go unassigned and have to be assigned again?

This is exactly what is happening. I see cluster goes into yellow every hour and many shards become unassigned.

Yep. You could try setting the keep alive mentioned here https://www.elastic.co/guide/en/elasticsearch/reference/2.1/modules-network.html#tcp-settings

Thanks for the pointer. I have found that linux keep-alive settings causes the problem. The default keep alive is too long. Once I have set this settings in sysctl, no more node drop problem anymore:

# Reduce keep alive values to increase frequency of keep alive packets sent across network in order to prevent ElasticSearch nodes losing connection between each other
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 20
2 Likes