Node disconnecting randomly

Hi,
We could see that nodes are disconnecting frequently from cluster. Below are the errors from Master and data nodes. We are using ELK version 5.3.1. Could you please help us to prevent the node disconnection issue?

**Error from Master node :** 
            [2021-02-22T12:38:16,153][INFO ][o.e.c.r.a.AllocationService] [elk-denmod-web] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300} transport disconnected]).
                        [2021-02-22T12:38:16,153][INFO ][o.e.c.s.ClusterService   ] [elk-denmod-web] removed {{elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300},}, reason: zen-disco-node-failed({elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300}), reason(transport disconnected)[{elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300} transport disconnected]
                        [2021-02-22T12:38:18,197][INFO ][o.e.c.r.DelayedAllocationService] [elk-denmod-web] scheduling reroute for delayed shards in [57.9s] (44 delayed shards)
                        [2021-02-22T12:38:20,318][INFO ][o.e.c.s.ClusterService   ] [elk-denmod-web] added {{elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300},}, reason: zen-disco-node-join[{elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300}]
                        [2021-02-22T12:39:45,620][INFO ][o.e.c.r.a.AllocationService] [elk-denmod-web] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[en_m_54][2]] ...]).
                        [2021-02-22T12:41:57,612][WARN ][o.e.l.LicenseService     ] [elk-denmod-web]

    **Error from data node:**
[2021-02-22T13:56:31,924][INFO ][o.e.d.z.ZenDiscovery     ] [elk-denmod-6] master_left [{elk-denmod-web}{sMFUmPkRQPiJPstaGBObwg}{SY08ahurSG2uH-TWIs4aJA}{xspw10f206w.pharma.aventis.com}{168.124.147.161:9300}], reason [transport disconnected]
            [2021-02-22T13:56:31,924][WARN ][o.e.d.z.ZenDiscovery     ] [elk-denmod-6] master left (reason = transport disconnected), current nodes: nodes: 
               {elk-denmod-6}{fU5vCMTTS46Pboz5zMXD0Q}{3Tomsar_ReuXxB8_4Ym0fA}{168.124.25.122}{168.124.25.122:9300}, local
               {elk-denmod-5}{yez29U5iQxWl2hh6Yh-_xg}{IKnp6uBtRGKlE9j3a70S9g}{168.124.54.142}{168.124.54.142:9300}
               {elk-denmod-4}{rS_6QTYGSiKTyXJnjvSdyw}{cMJFNWhxSG-DqTCuyKLoyQ}{168.124.170.244}{168.124.170.244:9300}
               {elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300}
               {elk-denmod-3}{5w1hFgnvRlOxtQ0QrX3tKQ}{wh63T78fSmWyamXRZje8jg}{168.124.29.126}{168.124.29.126:9300}
               {elk-denmod-web}{sMFUmPkRQPiJPstaGBObwg}{SY08ahurSG2uH-TWIs4aJA}{xspw10f206w.pharma.aventis.com}{168.124.147.161:9300}, master
               {elk-denmod-2}{VnvJTaKIQIqrdzXSlPUxxQ}{LIW0GTJdTMei5nWYebyszg}{168.124.29.129}{168.124.29.129:9300}

            [2021-02-22T13:56:52,376][WARN ][o.e.c.NodeConnectionsService] [elk-denmod-6] failed to connect to node {elk-denmod-web}{sMFUmPkRQPiJPstaGBObwg}{SY08ahurSG2uH-TWIs4aJA}{xspw10f206w.pharma.aventis.com}{168.124.147.161:9300} (tried [1] times)
            org.elasticsearch.transport.ConnectTransportException: [elk-denmod-web][168.124.147.161:9300] general node connection failure
                            at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:519) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:460) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:314) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:301) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:121) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.NodeConnectionsService.connectToNodes(NodeConnectionsService.java:87) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:780) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:633) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1117) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.3.1.jar:5.3.1]
                            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
                            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
                            at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
            Caused by: java.lang.IllegalStateException: handshake failed, channel already closed
                            at org.elasticsearch.transport.TcpTransport.executeHandshake(TcpTransport.java:1549) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:502) ~[elasticsearch-5.3.1.jar:5.3.1]
                            ... 14 more
            [2021-02-22T13:56:52,423][INFO ][o.e.g.DanglingIndicesState] [elk-denmod-6] failed to send allocated dangled
            org.elasticsearch.discovery.MasterNotDiscoveredException: no master to send allocate dangled request
                            at org.elasticsearch.gateway.LocalAllocateDangledIndices.allocateDangled(LocalAllocateDangledIndices.java:84) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.gateway.DanglingIndicesState.allocateDanglingIndices(DanglingIndicesState.java:164) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.gateway.DanglingIndicesState.processDanglingIndices(DanglingIndicesState.java:82) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.gateway.DanglingIndicesState.clusterChanged(DanglingIndicesState.java:185) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.service.ClusterService.lambda$publishAndApplyChanges$11(ClusterService.java:824) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) [?:1.8.0_131]
                            at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) [?:1.8.0_131]
                            at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) [?:1.8.0_131]
                            at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:821) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:633) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1117) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.3.1.jar:5.3.1]
                            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
                            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
                            at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
            [2021-02-22T13:56:52,454][INFO ][o.e.d.z.ZenDiscovery     ] [elk-denmod-6] failed to send join request to master [{elk-denmod-web}{sMFUmPkRQPiJPstaGBObwg}{SY08ahurSG2uH-TWIs4aJA}{xspw10f206w.pharma.aventis.com}{168.124.147.161:9300}], reason [RemoteTransportException[[elk-denmod-web][168.124.147.161:9300][internal:discovery/zen/join]]; nested: IllegalStateException[failure when sending a validation request to node]; nested: NodeDisconnectedException[[elk-denmod-6][168.124.25.122:9300][internal:discovery/zen/join/validate] disconnected]; ]
            [2021-02-22T13:56:56,120][INFO ][o.e.c.s.ClusterService   ] [elk-denmod-6] detected_master {elk-denmod-web}{sMFUmPkRQPiJPstaGBObwg}{SY08ahurSG2uH-TWIs4aJA}{xspw10f206w.pharma.aventis.com}{168.124.147.161:9300}, reason: zen-disco-receive(from master [master {elk-denmod-web}{sMFUmPkRQPiJPstaGBObwg}{SY08ahurSG2uH-TWIs4aJA}{xspw10f206w.pharma.aventis.com}{168.124.147.161:9300} committed version [68667]])

Can you someone help me on this please?

Elasticsearch 5.3.1 is very old and long EOL so I would recommend you upgrade. In order for someone to help troubleshoot this you probably need to provide a lot more details and context. How is your cluster configured and deployed? Are there any other clues in the logs around these times, e.g. frequent or slow GC? What load is the cluster under? What is the use case?

It would also help if you could provide the full output of the cluster stats API.

We have one master and 6 data nodes. Below is the setting of elastic search

cluster.name: elk-denmod
cluster.routing.allocation.disk.watermark.low: 95%
node.name: elk-denmod-web
bootstrap.memory_lock: true
network.host: [ELK1.pharma.aventis.com,_local_]
discovery.zen.ping.unicast.hosts: [ELK1.pharma.aventis.com, ELK2.pharma.aventis.com, ELK3.pharma.aventis.com, ELK4.pharma.aventis.com, ELK5.pharma.aventis.com, ELK6.pharma.aventis.com, ELK7.pharma.aventis.com, ELK8.pharma.aventis.com]
node.master: true
node.data: false
node.ingest: false
xpack.security.enabled: false
xpack.monitoring.enabled: false

Note: Server names are replaced by ELK(1-8) numbering

Sometime it is happening in 2 hours gap and sometime it is 4 hours gap

Below is the output of cluster stats api

{
  "_nodes" : {
    "total" : 7,
    "successful" : 7,
    "failed" : 0
  },
  "cluster_name" : "elk-denmod",
  "timestamp" : 1615276190452,
  "status" : "green",
  "indices" : {
    "count" : 26,
    "shards" : {
      "total" : 260,
      "primaries" : 130,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 12,
          "avg" : 10.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 6,
          "avg" : 5.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 670500630,
      "deleted" : 142205573
    },
    "store" : {
      "size" : "2.4tb",
      "size_in_bytes" : 2710981900986,
      "throttle_time" : "0s",
      "throttle_time_in_millis" : 0
    },
    "fielddata" : {
      "memory_size" : "0b",
      "memory_size_in_bytes" : 0,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "12.1mb",
      "memory_size_in_bytes" : 12750104,
      "total_count" : 884,
      "hit_count" : 450,
      "miss_count" : 434,
      "cache_size" : 120,
      "cache_count" : 120,
      "evictions" : 0
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 4801,
      "memory" : "7.4gb",
      "memory_in_bytes" : 7979402041,
      "terms_memory" : "6.5gb",
      "terms_memory_in_bytes" : 6990971223,
      "stored_fields_memory" : "425.5mb",
      "stored_fields_memory_in_bytes" : 446225384,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "184.1kb",
      "norms_memory_in_bytes" : 188608,
      "points_memory" : "481.2mb",
      "points_memory_in_bytes" : 504584294,
      "doc_values_memory" : "35.6mb",
      "doc_values_memory_in_bytes" : 37432532,
      "index_writer_memory" : "0b",
      "index_writer_memory_in_bytes" : 0,
      "version_map_memory" : "0b",
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set" : "0b",
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 7,
      "data" : 6,
      "coordinating_only" : 0,
      "master" : 1,
      "ingest" : 6
    },
    "versions" : [
      "5.3.1"
    ],
    "os" : {
      "available_processors" : 52,
      "allocated_processors" : 52,
      "names" : [
        {
          "name" : "Windows Server 2008 R2",
          "count" : 7
        }
      ],
      "mem" : {
        "total" : "104gb",
        "total_in_bytes" : 111669420032,
        "free" : "13.2gb",
        "free_in_bytes" : 14238105600,
        "used" : "90.7gb",
        "used_in_bytes" : 97431314432,
        "free_percent" : 13,
        "used_percent" : 87
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 0
      },
      "open_file_descriptors" : {
        "min" : -1,
        "max" : -1,
        "avg" : 0
      }
    },
    "jvm" : {
      "max_uptime" : "1.9h",
      "max_uptime_in_millis" : 7116306,
      "versions" : [
        {
          "version" : "1.8.0_131",
          "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
          "vm_version" : "25.131-b11",
          "vm_vendor" : "Oracle Corporation",
          "count" : 7
        }
      ],
      "mem" : {
        "heap_used" : "17.4gb",
        "heap_used_in_bytes" : 18777602992,
        "heap_max" : "48.6gb",
        "heap_max_in_bytes" : 52281081856
      },
      "threads" : 653
    },
    "fs" : {
      "total" : "6tb",
      "total_in_bytes" : 6657177260032,
      "free" : "3.3tb",
      "free_in_bytes" : 3639399837696,
      "available" : "3.3tb",
      "available_in_bytes" : 3639399837696
    },
    "plugins" : [
      {
        "name" : "x-pack",
        "version" : "5.3.1",
        "description" : "Elasticsearch Expanded Pack Plugin",
        "classname" : "org.elasticsearch.xpack.XPackPlugin"
      }
    ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 7
      },
      "http_types" : {
        "netty4" : 7
      }
    }
  }
}

how to find the load under cluster?

Can someone help me on this?

Please help me on this

How many of your nodes are master eligible? What is minimum_master_nodes set to? Can you share logs around GC?

2 nodes are master eligible nodes.

We didn't set that.

What do you mean by GC?

This means that your cluster is misconfigured, which can lead to split-brain scenarios and data loss. You should make sure you have exactly 3 master-eligible nodes in the cluster (2 is not good) and make sure you set discovery.zen.minimum_master_nodes to 2 to avoid split brain scenarios.

By GC I mean garbage collection. Please check logs for reports of long and/or frequent GC.

This node disconnection also related to this split brain scenarios

I do not know, but it could. This is why I also asked about GC.

We have found the below reference link for the same issue. does below tcp keep alive settings resolve the issue? if so, do we need to add the below line in elasticsearch.yml file?

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 20

i don't know how to check this GC. Could you please guide me how to check this?

I would recommend sticking to default timeout values. GC should be reported in the Elasticsearch logs so look there. It might also help if you describe how and where your cluster is deployed. What is the latency between the nodes in the cluster?

we just need to add this line in elasticsearch.yml file?

Could you please help us on the above question?

You really need to upgrade. That will probably solve most if not all the issues you are seeing.
I'm not sure we can help more than that.

Otherwise, share the full elasticsearch logs. May be we can find something useful in that.
You can share them to gist.github.com if too big for this forum.

We are in the process of migrating the elastic search to AWS. While migrating ELK data from On Prem to AWS cloud, we are getting disconnections error. Due to that we are unable to migrate the data. Could you please suggest what would be the best option to migrate the data to AWS? Please share the documentation if anything.

Are you going to run Elasticsearch by yourself on EC2 instances and manage that?

I'd highly recommend looking at Cloud by Elastic, also available if needed from AWS Marketplace.

Cloud by elastic is one way to have access to all features, all managed by us. Think about what is there yet like Security, Monitoring, Reporting, SQL, Canvas, Maps UI, Alerting and built-in solutions named Observability, Security, Enterprise Search and what is coming next :slight_smile: ...

I'd suggest reading this guide: Migrating your Elasticsearch data | Elasticsearch Service Documentation | Elastic

Could you please help me on this? do we need to add below configuration in elasticsearch.yml file?

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 20

Could you answer?

And also to this new question.

How?