Nodes randomly, temporarily, leaving 7.3.2 cluster

I'm really stumped here so I'm hoping that someone can point me in the right direction. Thanks!

Over the past couple weeks I've noticed that nodes will occasionally leave our cluster and then rejoin a short time later and I don't understand why. And by "occasionally" I really mean "occasionally", like it didn't happen for the past week and then just today 2 data nodes left, the cluster recovered, then 4 data nodes left, the cluster recovered, and then one of the master eligible nodes left. The traffic today didn't seem abnormal.

Cluster stats

  • The cluster runs on AWS
  • ES version 7.3.2
  • 3 dedicated masters (r4.2xlarge)
  • 3 dedicated coordinating (r4.2xlarge)
  • 5 dedicated ingest (r4.2xlarge)
  • 20 dedicated data nodes (i3.2xlarge)
  • 7,166 total shards
  • about 6.5 TB of storage
  • 30GB of heap space

The configuration on one of the data nodes that left (every other configuration is pretty much the same except for the different node roles):

  discovery:
    seed_providers: ec2
    ec2.tag.elastic_role: master
    ec2.tag.cluster_name: elasticsearch.platform.rate.com
  cluster:
    initial_master_nodes: ""
    name: elasticsearch.platform.rate.com
  node:
    master: false
    data: true
    ingest: false
    ml: false
    name: ""
    attr.box_type: warm
  network.host: 0.0.0.0
  xpack:
    ml.enabled: false
    monitoring:
      enabled: true
      elasticsearch.collection.enabled: true
      collection.enabled: true

The log which I'm seeing on the data node before it leaves:

{"type": "server", "timestamp": "2020-03-24T21:52:22,579+0000", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "elasticsearch.platform.rate.com", "node.name": "ip-10-101-20-143", "cl
uster.uuid": "0mOjFZ1wTy6OMWkB6mKyNw", "node.id": "z65w-XMGRcac9SrR57KyWw",  "message": "master node [{master-1}{uUNpXOzCRt6IZAE8VWAdeA}{W5hOAQ5qR9G6tAqxep7gfg}{10.101.23.77}{10.101.23.77:9300}{m}{xpack.i
nstalled=true, box_type=master}] failed, restarting discovery" , 
"stacktrace": ["org.elasticsearch.ElasticsearchException: node [{master-1}{uUNpXOzCRt6IZAE8VWAdeA}{W5hOAQ5qR9G6tAqxep7gfg}{10.101.23.77}{10.101.23.77:9300}{m}{xpack.installed=true, box_type=master}] faile
d [3] consecutive checks",
"at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:278) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:246) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:244) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:236) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:139) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]",
"at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) ~[?:?]",
"at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]",
"at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]",
"at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]",
"Caused by: org.elasticsearch.transport.RemoteTransportException: [master-1][10.101.23.77:9300][internal:coordination/fault_detection/leader_check]",
"Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: leader check from unknown node",

This seems to be the reason the node left:

failed, restarting discovery",
"stacktrace": ["org.elasticsearch.ElasticsearchException: node [{master-1}{uUNpXOzCRt6IZAE8VWAdeA}{W5hOAQ5qR9G6tAqxep7gfg}{10.101.23.77}{10.101.23.77:9300}{m}{xpack.installed=true, box_type=master}]
failed [3] consecutive checks",

But I'm stumped as to why the master failed these checks. It doesn't seem overworked if I look at the CPU, JVM usage, GC invocations, etc...

I don't see anything that seems noteworthy on the logs for the elected master with regards to nodes leaving. I do see a fair number of errors with indexing the monitoring information. Perhaps I should turn that setting off?

Thanks again for any and all help!

One thing we're trying is upgrading the version of docker which elasticsearch runs on. We are currently on version 18.09.3 and in the release notes for version 18.09.9 (https://docs.docker.com/engine/release-notes/18.09/) it says:

Fix service port for an application becomes unavailable randomly

That sounds very fishy. I'll upgrade docker and post back if it seems to resolve the issue.

Unfortunately that did not resolve our issue. For starters we're going to upgrade to the latest version of elasticsearch just to get that out of the way. One other thing we'd like to do is turn off dynamic mappings which is currently enabled on some of our indices.

This means the checks failed because the master already removed this node from the cluster. You'll need to look at the master logs to find out why.

Ohhhh interesting. I didn't know that. Thank you, I really appreciate the insight. Are there specific logging settings I could turn up to better debug this?

On the elected master, the only logs I see that seem related to cluster management are the node-left logs (and subsequent removed/added and node-join logs from the ClusterApplierService and MasterService components respectively) but they don't seem telling, to me at least, of why the node left:

{"type": "server", "timestamp": "2020-03-28T00:47:20,733+0000",
"level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name":
"elasticsearch.platform.rate.com", "node.name": "master-2",
"cluster.uuid": "0mOjFZ1wTy6OMWkB6mKyNw", "node.id":
"pYUJDXUCTVS7TkA4yrTE8g", "message":
"node-left[{ip-10-101-23-195}{igbwLJ0NQX-pJTiMelLntg}{uO09FgiPSR-x0nEBtOpXqQ}{10.101.23.195}{10.101.23.195:9300}{d}{xpack.installed=true,
box_type=warm} disconnected,
{ip-10-101-23-163}{SUAjay-KSdKbUJRfxxkx2g}{QLzBo1bGSau3DkVSUfdosQ}{10.101.23.163}{10.101.23.163:9300}{d}{xpack.installed=true,
box_type=hot} disconnected], term: 8, version: 2349565, reason:
removed
{{ip-10-101-23-163}{SUAjay-KSdKbUJRfxxkx2g}{QLzBo1bGSau3DkVSUfdosQ}{10.101.23.163}{10.101.23.163:9300}{d}{xpack.installed=true,
box_type=hot},{ip-10-101-23-195}{igbwLJ0NQX-pJTiMelLntg}{uO09FgiPSR-x0nEBtOpXqQ}{10.101.23.195}{10.101.23.195:9300}{d}{xpack.installed=true,
box_type=warm},}" }

And those node-left logs confuse me because I don't see a node-left log for all nodes that did leave and because I'm seeing the node-left log after the failed, restarting discovery log. Maybe the node-left logs aren't helpful for me then? For example the above node-left log fired at 2020-03-28T00:47:20 and earlier at 2020-03-28T00:44:33 I see the failed, restarting discovery log on 10.101.23.195. It feels like there should be some other log before 2020-03-28T00:44:33 which says why the master removed the node from the cluster:

{"type": "server", "timestamp": "2020-03-28T00:44:33,007+0000", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "elasticsearch.platform.rate.com", "node.name": "ip-10-101-23-195", "cl
uster.uuid": "0mOjFZ1wTy6OMWkB6mKyNw", "node.id": "igbwLJ0NQX-pJTiMelLntg",  "message": "master node [{master-2}{pYUJDXUCTVS7TkA4yrTE8g}{XKIRqK-6TZKywmXn2Xk9Vw}{10.101.25.191}{10.101.25.191:9300}{m}{xpack
.installed=true, box_type=master}] failed, restarting discovery" , 
"stacktrace": ["org.elasticsearch.ElasticsearchException: node [{master-2}{pYUJDXUCTVS7TkA4yrTE8g}{XKIRqK-6TZKywmXn2Xk9Vw}{10.101.25.191}{10.101.25.191:9300}{m}{xpack.installed=true, box_type=master}] fai
led [3] consecutive checks",
"at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:278) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:246) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:244) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:236) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:139) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]",
"at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) ~[?:?]",
"at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]",
"at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]",
"at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]",
"Caused by: org.elasticsearch.transport.RemoteTransportException: [master-2][10.101.25.191:9300][internal:coordination/fault_detection/leader_check]",
"Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: leader check from unknown node",

The key word in the node-left log line is disconnected, meaning this node was removed from the cluster because a TCP connection closed. Elasticsearch doesn't close these connections while it's running so either the node restarted or else something other than Elasticsearch closed the connection. Unfortunately Elasticsearch doesn't really know much more about it than that, so there's not really any extra logging available.

Make sure you're respecting the instructions about long-lived idle connections since that's a common culprit.

Also, regarding the timestamps, check that the clocks on your nodes are all properly synchronized. I think the node-left line comes first and that the timestamps are wrong.

Thanks a lot for that tip! I'll focus my efforts on trying to find some some underlying process/bug/whatever that is closing tcp connections (as well as look at the ES configurations around idle connections you linked).

And the clocks appear to be synchronized between the nodes.

Thanks again, I'll report back if I discover anything.

Also here's a couple other details about the OS and how ES is being run in case anyone has encountered this issue of TCP connections being closed:

ES is running as a docker container which gets started by systemd and here is the version of debian and the kernel:

$ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
VERSION_CODENAME=stretch
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"

$ cat /etc/debian_version 
9.11

$ uname -r
4.9.0-11-amd64

So I'm still seeing the issue of nodes dropping but I noticed a different reason in the elected master logs as to why the nodes were removed from the cluster:

node-left[{ip-10-101-21-150}{nWhvcQrPS6CjG8kHYk8Fjg}{sxqw6y32Qq2eW8S0NBdsiA}
{10.101.21.150}{10.101.21.150:9300}{d}{xpack.installed=true,
box_type=warm} followers check retry count exceeded,

And another related log line:

followers check retry count exceeded]] took [30.9s] which is above the
warn threshold of 30s" }

What can cause the follower check to fail like this? To add some context, this happened around when the UTC date changed at which point a bunch of new daily indices start getting written. I wonder if the nodes became less responsive so the follower check failed?

I also still see those disconnected logs happening. Still debugging that.

This means the node stopped responding to lightweight health checks for over 30 seconds, which is normally either network issues (e.g. packet loss) or else the node spent a very long time doing GC.

That's interesting... GC doesn't seem all that active, at least not abnormally so, (one of the nodes that got removed from the cluster had microsecond levels of GC runtime per second) which leaves us with networking issues. But it's also strange to me that it would be networking issues in the sense that the nodes communicate within a VPC on AWS so from node to node it's just a single hop and I can't imagine a connection like that failing.

I'll keep trying things and post back here. Thanks again.

I think there's quite a bit of fairly complex infrastructure in between nodes in an AWS VPC, although it's well-hidden from the nodes themselves. It's certainly not a "single hop" in reality.

I've definitely seen packet loss on this kind of a connection, but not normally a connection reset that would lead to a disconnected message; at least I've only seen that when keepalives were repeatedly not being delivered.

Good to keep in mind, thank you. The other thing that strikes me as odd here then is that we also have a dev version of the cluster (same exact AMI, terraform, and all that but just with less nodes in the cluster) deployed in a different AWS account and it has never experienced these issues as far as I'm aware... Then again maybe we just haven't been paying close enough attention to it.

I'll look into how to see whether keepalives are getting delivered or not.

It almost makes me wonder if the load on the one cluster happens to exploit a configuration of tcp keep alives that is less than optimal (where as the cluster with less traffic is fine). The default keep alive settings for Debian buster are what's configured:

$ sudo sysctl --all | grep keepalive
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

Also, out of curiosity, why did you introduce the leader+follower check logic (https://github.com/elastic/elasticsearch/pull/33024, https://github.com/elastic/elasticsearch/pull/33917) when it seems like equivalent functionality to do so already existed in v6 of elasticsearch: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/modules-discovery-zen.html#fault-detection ?

Fault detection is closely linked to cluster coordination, and the whole cluster coordination module was replaced in 7.0 (https://github.com/elastic/elasticsearch/issues/32006, https://www.elastic.co/blog/a-new-era-for-cluster-coordination-in-elasticsearch). The actual behaviour of these fault detectors is not materially different between v6 and v7.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.