Node disconnecting randomly

DENMODSupport · February 22, 2021, 3:10pm

Hi,
We could see that nodes are disconnecting frequently from cluster. Below are the errors from Master and data nodes. We are using ELK version 5.3.1. Could you please help us to prevent the node disconnection issue?

**Error from Master node :** 
            [2021-02-22T12:38:16,153][INFO ][o.e.c.r.a.AllocationService] [elk-denmod-web] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300} transport disconnected]).
                        [2021-02-22T12:38:16,153][INFO ][o.e.c.s.ClusterService   ] [elk-denmod-web] removed {{elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300},}, reason: zen-disco-node-failed({elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300}), reason(transport disconnected)[{elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300} transport disconnected]
                        [2021-02-22T12:38:18,197][INFO ][o.e.c.r.DelayedAllocationService] [elk-denmod-web] scheduling reroute for delayed shards in [57.9s] (44 delayed shards)
                        [2021-02-22T12:38:20,318][INFO ][o.e.c.s.ClusterService   ] [elk-denmod-web] added {{elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300},}, reason: zen-disco-node-join[{elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300}]
                        [2021-02-22T12:39:45,620][INFO ][o.e.c.r.a.AllocationService] [elk-denmod-web] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[en_m_54][2]] ...]).
                        [2021-02-22T12:41:57,612][WARN ][o.e.l.LicenseService     ] [elk-denmod-web]

    **Error from data node:**
[2021-02-22T13:56:31,924][INFO ][o.e.d.z.ZenDiscovery     ] [elk-denmod-6] master_left [{elk-denmod-web}{sMFUmPkRQPiJPstaGBObwg}{SY08ahurSG2uH-TWIs4aJA}{xspw10f206w.pharma.aventis.com}{168.124.147.161:9300}], reason [transport disconnected]
            [2021-02-22T13:56:31,924][WARN ][o.e.d.z.ZenDiscovery     ] [elk-denmod-6] master left (reason = transport disconnected), current nodes: nodes: 
               {elk-denmod-6}{fU5vCMTTS46Pboz5zMXD0Q}{3Tomsar_ReuXxB8_4Ym0fA}{168.124.25.122}{168.124.25.122:9300}, local
               {elk-denmod-5}{yez29U5iQxWl2hh6Yh-_xg}{IKnp6uBtRGKlE9j3a70S9g}{168.124.54.142}{168.124.54.142:9300}
               {elk-denmod-4}{rS_6QTYGSiKTyXJnjvSdyw}{cMJFNWhxSG-DqTCuyKLoyQ}{168.124.170.244}{168.124.170.244:9300}
               {elk-denmod-1}{t7JmFe-iSn63CALYD-0lxw}{7aHU8PepTH-PAzPrnr-q8w}{168.124.25.140}{168.124.25.140:9300}
               {elk-denmod-3}{5w1hFgnvRlOxtQ0QrX3tKQ}{wh63T78fSmWyamXRZje8jg}{168.124.29.126}{168.124.29.126:9300}
               {elk-denmod-web}{sMFUmPkRQPiJPstaGBObwg}{SY08ahurSG2uH-TWIs4aJA}{xspw10f206w.pharma.aventis.com}{168.124.147.161:9300}, master
               {elk-denmod-2}{VnvJTaKIQIqrdzXSlPUxxQ}{LIW0GTJdTMei5nWYebyszg}{168.124.29.129}{168.124.29.129:9300}

            [2021-02-22T13:56:52,376][WARN ][o.e.c.NodeConnectionsService] [elk-denmod-6] failed to connect to node {elk-denmod-web}{sMFUmPkRQPiJPstaGBObwg}{SY08ahurSG2uH-TWIs4aJA}{xspw10f206w.pharma.aventis.com}{168.124.147.161:9300} (tried [1] times)
            org.elasticsearch.transport.ConnectTransportException: [elk-denmod-web][168.124.147.161:9300] general node connection failure
                            at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:519) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:460) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:314) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:301) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:121) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.NodeConnectionsService.connectToNodes(NodeConnectionsService.java:87) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:780) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:633) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1117) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.3.1.jar:5.3.1]
                            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
                            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
                            at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
            Caused by: java.lang.IllegalStateException: handshake failed, channel already closed
                            at org.elasticsearch.transport.TcpTransport.executeHandshake(TcpTransport.java:1549) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:502) ~[elasticsearch-5.3.1.jar:5.3.1]
                            ... 14 more
            [2021-02-22T13:56:52,423][INFO ][o.e.g.DanglingIndicesState] [elk-denmod-6] failed to send allocated dangled
            org.elasticsearch.discovery.MasterNotDiscoveredException: no master to send allocate dangled request
                            at org.elasticsearch.gateway.LocalAllocateDangledIndices.allocateDangled(LocalAllocateDangledIndices.java:84) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.gateway.DanglingIndicesState.allocateDanglingIndices(DanglingIndicesState.java:164) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.gateway.DanglingIndicesState.processDanglingIndices(DanglingIndicesState.java:82) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.gateway.DanglingIndicesState.clusterChanged(DanglingIndicesState.java:185) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.service.ClusterService.lambda$publishAndApplyChanges$11(ClusterService.java:824) ~[elasticsearch-5.3.1.jar:5.3.1]
                            at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) [?:1.8.0_131]
                            at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) [?:1.8.0_131]
                            at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) [?:1.8.0_131]
                            at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:821) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:633) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1117) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.3.1.jar:5.3.1]
                            at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.3.1.jar:5.3.1]
                            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
                            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
                            at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
            [2021-02-22T13:56:52,454][INFO ][o.e.d.z.ZenDiscovery     ] [elk-denmod-6] failed to send join request to master [{elk-denmod-web}{sMFUmPkRQPiJPstaGBObwg}{SY08ahurSG2uH-TWIs4aJA}{xspw10f206w.pharma.aventis.com}{168.124.147.161:9300}], reason [RemoteTransportException[[elk-denmod-web][168.124.147.161:9300][internal:discovery/zen/join]]; nested: IllegalStateException[failure when sending a validation request to node]; nested: NodeDisconnectedException[[elk-denmod-6][168.124.25.122:9300][internal:discovery/zen/join/validate] disconnected]; ]
            [2021-02-22T13:56:56,120][INFO ][o.e.c.s.ClusterService   ] [elk-denmod-6] detected_master {elk-denmod-web}{sMFUmPkRQPiJPstaGBObwg}{SY08ahurSG2uH-TWIs4aJA}{xspw10f206w.pharma.aventis.com}{168.124.147.161:9300}, reason: zen-disco-receive(from master [master {elk-denmod-web}{sMFUmPkRQPiJPstaGBObwg}{SY08ahurSG2uH-TWIs4aJA}{xspw10f206w.pharma.aventis.com}{168.124.147.161:9300} committed version [68667]])

DENMODSupport · February 23, 2021, 5:43am

Can you someone help me on this please?

Christian_Dahlqvist · February 23, 2021, 6:05am

Elasticsearch 5.3.1 is very old and long EOL so I would recommend you upgrade. In order for someone to help troubleshoot this you probably need to provide a lot more details and context. How is your cluster configured and deployed? Are there any other clues in the logs around these times, e.g. frequent or slow GC? What load is the cluster under? What is the use case?

It would also help if you could provide the full output of the cluster stats API.

DENMODSupport · March 9, 2021, 10:42am

We have one master and 6 data nodes. Below is the setting of Elasticsearch

cluster.name: elk-denmod
cluster.routing.allocation.disk.watermark.low: 95%
node.name: elk-denmod-web
bootstrap.memory_lock: true
network.host: [ELK1.pharma.aventis.com,_local_]
discovery.zen.ping.unicast.hosts: [ELK1.pharma.aventis.com, ELK2.pharma.aventis.com, ELK3.pharma.aventis.com, ELK4.pharma.aventis.com, ELK5.pharma.aventis.com, ELK6.pharma.aventis.com, ELK7.pharma.aventis.com, ELK8.pharma.aventis.com]
node.master: true
node.data: false
node.ingest: false
xpack.security.enabled: false
xpack.monitoring.enabled: false

Note: Server names are replaced by ELK(1-8) numbering

Sometime it is happening in 2 hours gap and sometime it is 4 hours gap

Below is the output of cluster stats api

{
  "_nodes" : {
    "total" : 7,
    "successful" : 7,
    "failed" : 0
  },
  "cluster_name" : "elk-denmod",
  "timestamp" : 1615276190452,
  "status" : "green",
  "indices" : {
    "count" : 26,
    "shards" : {
      "total" : 260,
      "primaries" : 130,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 12,
          "avg" : 10.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 6,
          "avg" : 5.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 670500630,
      "deleted" : 142205573
    },
    "store" : {
      "size" : "2.4tb",
      "size_in_bytes" : 2710981900986,
      "throttle_time" : "0s",
      "throttle_time_in_millis" : 0
    },
    "fielddata" : {
      "memory_size" : "0b",
      "memory_size_in_bytes" : 0,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "12.1mb",
      "memory_size_in_bytes" : 12750104,
      "total_count" : 884,
      "hit_count" : 450,
      "miss_count" : 434,
      "cache_size" : 120,
      "cache_count" : 120,
      "evictions" : 0
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 4801,
      "memory" : "7.4gb",
      "memory_in_bytes" : 7979402041,
      "terms_memory" : "6.5gb",
      "terms_memory_in_bytes" : 6990971223,
      "stored_fields_memory" : "425.5mb",
      "stored_fields_memory_in_bytes" : 446225384,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "184.1kb",
      "norms_memory_in_bytes" : 188608,
      "points_memory" : "481.2mb",
      "points_memory_in_bytes" : 504584294,
      "doc_values_memory" : "35.6mb",
      "doc_values_memory_in_bytes" : 37432532,
      "index_writer_memory" : "0b",
      "index_writer_memory_in_bytes" : 0,
      "version_map_memory" : "0b",
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set" : "0b",
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 7,
      "data" : 6,
      "coordinating_only" : 0,
      "master" : 1,
      "ingest" : 6
    },
    "versions" : [
      "5.3.1"
    ],
    "os" : {
      "available_processors" : 52,
      "allocated_processors" : 52,
      "names" : [
        {
          "name" : "Windows Server 2008 R2",
          "count" : 7
        }
      ],
      "mem" : {
        "total" : "104gb",
        "total_in_bytes" : 111669420032,
        "free" : "13.2gb",
        "free_in_bytes" : 14238105600,
        "used" : "90.7gb",
        "used_in_bytes" : 97431314432,
        "free_percent" : 13,
        "used_percent" : 87
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 0
      },
      "open_file_descriptors" : {
        "min" : -1,
        "max" : -1,
        "avg" : 0
      }
    },
    "jvm" : {
      "max_uptime" : "1.9h",
      "max_uptime_in_millis" : 7116306,
      "versions" : [
        {
          "version" : "1.8.0_131",
          "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
          "vm_version" : "25.131-b11",
          "vm_vendor" : "Oracle Corporation",
          "count" : 7
        }
      ],
      "mem" : {
        "heap_used" : "17.4gb",
        "heap_used_in_bytes" : 18777602992,
        "heap_max" : "48.6gb",
        "heap_max_in_bytes" : 52281081856
      },
      "threads" : 653
    },
    "fs" : {
      "total" : "6tb",
      "total_in_bytes" : 6657177260032,
      "free" : "3.3tb",
      "free_in_bytes" : 3639399837696,
      "available" : "3.3tb",
      "available_in_bytes" : 3639399837696
    },
    "plugins" : [
      {
        "name" : "x-pack",
        "version" : "5.3.1",
        "description" : "Elasticsearch Expanded Pack Plugin",
        "classname" : "org.elasticsearch.xpack.XPackPlugin"
      }
    ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 7
      },
      "http_types" : {
        "netty4" : 7
      }
    }
  }
}

how to find the load under cluster?

DENMODSupport · March 11, 2021, 4:22pm

Can someone help me on this?

DENMODSupport · March 15, 2021, 7:54am

Please help me on this

Christian_Dahlqvist · March 15, 2021, 8:11am

How many of your nodes are master eligible? What is minimum_master_nodes set to? Can you share logs around GC?

DENMODSupport · March 15, 2021, 8:37am

2 nodes are master eligible nodes.

We didn't set that.

What do you mean by GC?

Christian_Dahlqvist · March 15, 2021, 10:30am

This means that your cluster is misconfigured, which can lead to split-brain scenarios and data loss. You should make sure you have exactly 3 master-eligible nodes in the cluster (2 is not good) and make sure you set discovery.zen.minimum_master_nodes to 2 to avoid split brain scenarios.

By GC I mean garbage collection. Please check logs for reports of long and/or frequent GC.

DENMODSupport · March 15, 2021, 11:29am

This node disconnection also related to this split brain scenarios

Christian_Dahlqvist · March 15, 2021, 11:38am

I do not know, but it could. This is why I also asked about GC.

DENMODSupport · March 15, 2021, 12:02pm

We have found the below reference link for the same issue. does below tcp keep alive settings resolve the issue? if so, do we need to add the below line in elasticsearch.yml file?

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 20

i don't know how to check this GC. Could you please guide me how to check this?

Christian_Dahlqvist · March 15, 2021, 1:37pm

I would recommend sticking to default timeout values. GC should be reported in the Elasticsearch logs so look there. It might also help if you describe how and where your cluster is deployed. What is the latency between the nodes in the cluster?

DENMODSupport · March 24, 2021, 4:14am

we just need to add this line in elasticsearch.yml file?

DENMODSupport · March 29, 2021, 1:41pm

Could you please help us on the above question?

dadoonet · March 29, 2021, 1:48pm

You really need to upgrade. That will probably solve most if not all the issues you are seeing.
I'm not sure we can help more than that.

Otherwise, share the full elasticsearch logs. May be we can find something useful in that.
You can share them to gist.github.com if too big for this forum.

DENMODSupport · March 29, 2021, 3:23pm

We are in the process of migrating the Elasticsearch to AWS. While migrating ELK data from On Prem to AWS cloud, we are getting disconnections error. Due to that we are unable to migrate the data. Could you please suggest what would be the best option to migrate the data to AWS? Please share the documentation if anything.

dadoonet · March 29, 2021, 4:00pm

Are you going to run Elasticsearch by yourself on EC2 instances and manage that?

I'd highly recommend looking at Cloud by Elastic, also available if needed from AWS Marketplace.

Cloud by elastic is one way to have access to all features, all managed by us. Think about what is there yet like Security, Monitoring, Reporting, SQL, Canvas, Maps UI, Alerting and built-in solutions named Observability, Security, Enterprise Search and what is coming next ...

I'd suggest reading this guide: Migrate data | Elasticsearch Service Documentation | Elastic

DENMODSupport · March 30, 2021, 2:42pm

Could you please help me on this? do we need to add below configuration in elasticsearch.yml file?

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 20

dadoonet · March 30, 2021, 3:01pm

Could you answer?

And also to this new question.

How?

Topic		Replies	Views
Nodes randomly disconnected from the ES cluster Elasticsearch	10	7304	November 4, 2022
Nodes disconnected randomly Elasticsearch painless	1	346	September 19, 2022
Frequent disconnects between nodes Elasticsearch	13	2333	July 6, 2017
Seeing Frequent NodeNotConnectedException errors Elasticsearch	4	12244	July 5, 2017
Nodes randomly, temporarily, leaving 7.3.2 cluster Elasticsearch	17	4982	May 1, 2020

Node disconnecting randomly

Related topics