Nodes randomly, temporarily, leaving 7.3.2 cluster

I'm really stumped here so I'm hoping that someone can point me in the right direction. Thanks!

Over the past couple weeks I've noticed that nodes will occasionally leave our cluster and then rejoin a short time later and I don't understand why. And by "occasionally" I really mean "occasionally", like it didn't happen for the past week and then just today 2 data nodes left, the cluster recovered, then 4 data nodes left, the cluster recovered, and then one of the master eligible nodes left. The traffic today didn't seem abnormal.

Cluster stats

  • The cluster runs on AWS
  • ES version 7.3.2
  • 3 dedicated masters (r4.2xlarge)
  • 3 dedicated coordinating (r4.2xlarge)
  • 5 dedicated ingest (r4.2xlarge)
  • 20 dedicated data nodes (i3.2xlarge)
  • 7,166 total shards
  • about 6.5 TB of storage
  • 30GB of heap space

The configuration on one of the data nodes that left (every other configuration is pretty much the same except for the different node roles):

  discovery:
    seed_providers: ec2
    ec2.tag.elastic_role: master
    ec2.tag.cluster_name: elasticsearch.platform.rate.com
  cluster:
    initial_master_nodes: ""
    name: elasticsearch.platform.rate.com
  node:
    master: false
    data: true
    ingest: false
    ml: false
    name: ""
    attr.box_type: warm
  network.host: 0.0.0.0
  xpack:
    ml.enabled: false
    monitoring:
      enabled: true
      elasticsearch.collection.enabled: true
      collection.enabled: true

The log which I'm seeing on the data node before it leaves:

{"type": "server", "timestamp": "2020-03-24T21:52:22,579+0000", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "elasticsearch.platform.rate.com", "node.name": "ip-10-101-20-143", "cl
uster.uuid": "0mOjFZ1wTy6OMWkB6mKyNw", "node.id": "z65w-XMGRcac9SrR57KyWw",  "message": "master node [{master-1}{uUNpXOzCRt6IZAE8VWAdeA}{W5hOAQ5qR9G6tAqxep7gfg}{10.101.23.77}{10.101.23.77:9300}{m}{xpack.i
nstalled=true, box_type=master}] failed, restarting discovery" , 
"stacktrace": ["org.elasticsearch.ElasticsearchException: node [{master-1}{uUNpXOzCRt6IZAE8VWAdeA}{W5hOAQ5qR9G6tAqxep7gfg}{10.101.23.77}{10.101.23.77:9300}{m}{xpack.installed=true, box_type=master}] faile
d [3] consecutive checks",
"at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:278) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:246) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:244) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:236) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:139) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]",
"at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) ~[?:?]",
"at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]",
"at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]",
"at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]",
"Caused by: org.elasticsearch.transport.RemoteTransportException: [master-1][10.101.23.77:9300][internal:coordination/fault_detection/leader_check]",
"Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: leader check from unknown node",

This seems to be the reason the node left:

failed, restarting discovery",
"stacktrace": ["org.elasticsearch.ElasticsearchException: node [{master-1}{uUNpXOzCRt6IZAE8VWAdeA}{W5hOAQ5qR9G6tAqxep7gfg}{10.101.23.77}{10.101.23.77:9300}{m}{xpack.installed=true, box_type=master}]
failed [3] consecutive checks",

But I'm stumped as to why the master failed these checks. It doesn't seem overworked if I look at the CPU, JVM usage, GC invocations, etc...

I don't see anything that seems noteworthy on the logs for the elected master with regards to nodes leaving. I do see a fair number of errors with indexing the monitoring information. Perhaps I should turn that setting off?

Thanks again for any and all help!

One thing we're trying is upgrading the version of docker which elasticsearch runs on. We are currently on version 18.09.3 and in the release notes for version 18.09.9 (https://docs.docker.com/engine/release-notes/18.09/) it says:

Fix service port for an application becomes unavailable randomly

That sounds very fishy. I'll upgrade docker and post back if it seems to resolve the issue.

Unfortunately that did not resolve our issue. For starters we're going to upgrade to the latest version of elasticsearch just to get that out of the way. One other thing we'd like to do is turn off dynamic mappings which is currently enabled on some of our indices.

This means the checks failed because the master already removed this node from the cluster. You'll need to look at the master logs to find out why.

Ohhhh interesting. I didn't know that. Thank you, I really appreciate the insight. Are there specific logging settings I could turn up to better debug this?

On the elected master, the only logs I see that seem related to cluster management are the node-left logs (and subsequent removed/added and node-join logs from the ClusterApplierService and MasterService components respectively) but they don't seem telling, to me at least, of why the node left:

{"type": "server", "timestamp": "2020-03-28T00:47:20,733+0000",
"level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name":
"elasticsearch.platform.rate.com", "node.name": "master-2",
"cluster.uuid": "0mOjFZ1wTy6OMWkB6mKyNw", "node.id":
"pYUJDXUCTVS7TkA4yrTE8g", "message":
"node-left[{ip-10-101-23-195}{igbwLJ0NQX-pJTiMelLntg}{uO09FgiPSR-x0nEBtOpXqQ}{10.101.23.195}{10.101.23.195:9300}{d}{xpack.installed=true,
box_type=warm} disconnected,
{ip-10-101-23-163}{SUAjay-KSdKbUJRfxxkx2g}{QLzBo1bGSau3DkVSUfdosQ}{10.101.23.163}{10.101.23.163:9300}{d}{xpack.installed=true,
box_type=hot} disconnected], term: 8, version: 2349565, reason:
removed
{{ip-10-101-23-163}{SUAjay-KSdKbUJRfxxkx2g}{QLzBo1bGSau3DkVSUfdosQ}{10.101.23.163}{10.101.23.163:9300}{d}{xpack.installed=true,
box_type=hot},{ip-10-101-23-195}{igbwLJ0NQX-pJTiMelLntg}{uO09FgiPSR-x0nEBtOpXqQ}{10.101.23.195}{10.101.23.195:9300}{d}{xpack.installed=true,
box_type=warm},}" }

And those node-left logs confuse me because I don't see a node-left log for all nodes that did leave and because I'm seeing the node-left log after the failed, restarting discovery log. Maybe the node-left logs aren't helpful for me then? For example the above node-left log fired at 2020-03-28T00:47:20 and earlier at 2020-03-28T00:44:33 I see the failed, restarting discovery log on 10.101.23.195. It feels like there should be some other log before 2020-03-28T00:44:33 which says why the master removed the node from the cluster:

{"type": "server", "timestamp": "2020-03-28T00:44:33,007+0000", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "elasticsearch.platform.rate.com", "node.name": "ip-10-101-23-195", "cl
uster.uuid": "0mOjFZ1wTy6OMWkB6mKyNw", "node.id": "igbwLJ0NQX-pJTiMelLntg",  "message": "master node [{master-2}{pYUJDXUCTVS7TkA4yrTE8g}{XKIRqK-6TZKywmXn2Xk9Vw}{10.101.25.191}{10.101.25.191:9300}{m}{xpack
.installed=true, box_type=master}] failed, restarting discovery" , 
"stacktrace": ["org.elasticsearch.ElasticsearchException: node [{master-2}{pYUJDXUCTVS7TkA4yrTE8g}{XKIRqK-6TZKywmXn2Xk9Vw}{10.101.25.191}{10.101.25.191:9300}{m}{xpack.installed=true, box_type=master}] fai
led [3] consecutive checks",
"at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:278) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:246) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:244) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:236) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:139) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) ~[elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]",
"at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) ~[?:?]",
"at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) ~[?:?]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) ~[?:?]",
"at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) ~[?:?]",
"at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) ~[?:?]",
"at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]",
"at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]",
"at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]",
"Caused by: org.elasticsearch.transport.RemoteTransportException: [master-2][10.101.25.191:9300][internal:coordination/fault_detection/leader_check]",
"Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: leader check from unknown node",

The key word in the node-left log line is disconnected, meaning this node was removed from the cluster because a TCP connection closed. Elasticsearch doesn't close these connections while it's running so either the node restarted or else something other than Elasticsearch closed the connection. Unfortunately Elasticsearch doesn't really know much more about it than that, so there's not really any extra logging available.

Make sure you're respecting the instructions about long-lived idle connections since that's a common culprit.

Also, regarding the timestamps, check that the clocks on your nodes are all properly synchronized. I think the node-left line comes first and that the timestamps are wrong.

Thanks a lot for that tip! I'll focus my efforts on trying to find some some underlying process/bug/whatever that is closing tcp connections (as well as look at the ES configurations around idle connections you linked).

And the clocks appear to be synchronized between the nodes.

Thanks again, I'll report back if I discover anything.

Also here's a couple other details about the OS and how ES is being run in case anyone has encountered this issue of TCP connections being closed:

ES is running as a docker container which gets started by systemd and here is the version of debian and the kernel:

$ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
VERSION_CODENAME=stretch
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"

$ cat /etc/debian_version 
9.11

$ uname -r
4.9.0-11-amd64