Nodes being dropped from cluster

I am currently trying to set up a 17 node cluster.

3 master nodes - 4 vCPU, 12 GB RAM, 7 GB Heap
10 data nodes - 15 vCPU, 116 GB RAM, 32 GB Heap
3 ingest nodes - 7 vCPU, 8 GB RAM, 2 GB Heap (Is this enough?)
1 coordinator - 15 vCPU, 56 GB RAM, 32 GB Heap

This is running on Kubernetes via ECK.

I am ingesting about 2.5 TB a day into two day based indexes, totally around 4.5B documents.
1 index with 60 shards and another one with 15 shards.

I'm only running test loads right now, but it seems that one or more of my nodes are constantly dropping during what looks to be the cluster state publishing process?

I am not sure if it's an inter-pod connectivity issue.

Here are my logs:

The role of the pod outputting the log line is in the second column. This log shows the node boba-es-data-1 being removed from all nodes but it's not limited to just the data nodes.

Thanks

1 Like

You only shared a tiny fraction of logs, but they contain this:

2021-02-03T17:09:18.626Z,elasticsearch,boba-es-master-1,"node-left[{boba-es-data-1}{ymsWmXsvRgyw2DsExL8hfw}{f_d7tzEIRtqbeROpC5tbHA}{172.20.120.3}{172.20.120.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--a0fccf4d-3q3m, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 8050, delta: removed {{boba-es-data-1}{ymsWmXsvRgyw2DsExL8hfw}{f_d7tzEIRtqbeROpC5tbHA}{172.20.120.3}{172.20.120.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--a0fccf4d-3q3m, xpack.installed=true, transform.node=false}}"

In particular reason: disconnected indicates a connection between nodes was closed by an external force. We'd need to see logs covering a longer timescale to be sure.

I'd also recommend against such verbose logging, the verbosity can itself cause issues, and the information needed is logged at INFO anyway.

Thanks for the response @DavidTurner .

Here's roughly a 2 hour span of logs without trace.

There's a lot of these reason: disconnected messages. I'm wondering if it's something within my cluster that's throttling them.

Is it normal for this to be happening at the same minute every hour?

date,Host,Service,message
2021-02-04T00:11:52.393Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 11089, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"
2021-02-03T23:11:49.567Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 10538, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"
2021-02-03T22:11:46.769Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 10176, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"
2021-02-03T21:11:43.878Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 9748, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"
2021-02-03T21:11:43.877Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"will process [node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected]]"
2021-02-03T20:11:40.911Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 9372, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"
2021-02-03T20:11:40.910Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"will process [node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected]]"
2021-02-03T19:11:37.735Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 8931, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"
2021-02-03T19:11:37.735Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"will process [node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected]]"
2021-02-03T18:11:34.546Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 8566, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"
2021-02-03T18:11:34.546Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"will process [node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected]]"
2021-02-03T17:11:31.774Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 8093, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"
2021-02-03T17:11:31.774Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"will process [node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected]]"
2021-02-03T16:11:28.643Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 7737, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"
2021-02-03T16:11:28.642Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"will process [node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected]]"
2021-02-03T15:11:25.135Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 7326, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"
2021-02-03T15:11:25.135Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"will process [node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected]]"
2021-02-03T14:11:21.969Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 6890, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"
2021-02-03T14:11:21.968Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"will process [node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected]]"
2021-02-03T08:11:03.750Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 4452, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"
2021-02-03T08:11:03.750Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"will process [node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected]]"
2021-02-03T07:11:02.453Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 4080, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"
2021-02-03T07:11:02.452Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"will process [node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected]]"
2021-02-03T06:11:01.200Z,gke-cx-production-boba-elasticsearch--35ca4ce8-mtm4.c.cxa-prod.internal,elasticsearch,"node-left[{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false} reason: disconnected], term: 3, version: 3681, delta: removed {{boba-es-data-0}{R_hzHKf0T5OqmCUwY59r1w}{rV6CW0BoQeKQCKdsqt2QPA}{172.20.121.3}{172.20.121.3:9300}{d}{k8s_node_name=gke-cx-production-boba-elasticsearch--fa6c311f-pmmk, xpack.installed=true, transform.node=false}}"

No, Elasticsearch doesn't drop TCP connections like that, the cause is external to the cluster.

This suggests something on your network has a 1-hour timeout. Quoting the docs (emphasis mine):

A transport connection between two nodes is made up of a number of long-lived TCP connections, some of which may be idle for an extended period of time. Nonetheless, Elasticsearch requires these connections to remain open, and it can disrupt the operation of your cluster if any inter-node connections are closed by an external influence such as a firewall. It is important to configure your network to preserve long-lived idle connections between Elasticsearch nodes, for instance by leaving tcp.keep_alive enabled and ensuring that the keepalive interval is shorter than any timeout that might cause idle connections to be closed, or by setting transport.ping_schedule if keepalives cannot be configured. Devices which drop connections when they reach a certain age are a common source of problems to Elasticsearch clusters, and must not be used.

Thanks @DavidTurner . I found out that our Istio mesh is set to time out long running connections after a hour. Keep-alives don't seem to be effective but after adding the ping_schedule setting, it stopped disconnecting.

Thanks

I wouldn't recommend running Elasticsearch in a service mesh like Istio. There's no need, Elasticsearch handles its own clustering and intra-cluster routing, so it will only cause you problems like this. ping_schedule is much weaker than TCP keepalives.

Thanks, I'll keep that in mind.