After almost a week of perfect stability, suddenly the jitter has returned with a corresponding spike in load on wh381, one of the new servers and the one that always seems to be acting up the most. Grabbed hot-threads output.
For one of the Elasticsearch nodes running on the machine it's all transport_client_boss threads and on the other it's transport_worker_server.default
So, yeah. We're back to square one. I now believe that the jdk upgrade didn't actually fix anything. I believe it was just restarting the elasticsearch processes, and once they had been up for ~6 days again, the same problems have re-appeared. Indexing across all of the hot servers was again terrible this morning, and the very minute we shut down the elasticsearch processes on wh381 everything returned to being essentially perfect.
So, we run two Elasticsearch processes on each server. The engineer that took down Elasticsearch on wh381 accidentally only took down one of the two. So, we went from the entire cluster unable to index properly to everything being perfect within about a minute of just one of the two Elasticsearch processes on this server going down.
Can you share the hot threads output, and identify which nodes are which in it? Specifically, which are the two pairs of nodes running on the new hosts and which one was the one that was shut down?
Sure. It's way too big to post in a comment, assuming you'd want to see hot threads output for more than just the problem server, since this impacts indexing on every node. Can you provide another upload link?
Done and done. The first part of the file is a hot-threads from when we were looking bad. You'll notice there is both an elk-elasticsearchwh381-eqx-1 and an elk-elasticsearchwh381-eqx-2.
Then there are some blank lines and I separated the next output via ---. What follows is output from a hot-threads from just a few minutes ago where everything looks great. Only elk-elasticsearchwh381-2 is present from that server in that one.
I'm tempted to not care anymore. Everything is so perfect right now, including a few lengthy spikes to almost double traffic where indexing tracked perfectly, I might just say "This is the weird server" and leave it running only one node. Unless problems return, that is.
The engineer in me still wants a root cause, so if anyone has ideas I'd love to hear them, but I cannot justify too much more of my own time. Between myself and the other engineers that have been trying to figure this out for almost two months, our combined salary hours spent on this is starting to be pretty ridiculous.
In the "slow" hot threads dump, both of the elk-elasticsearchwh381-eqx nodes are at 100% CPU on the three listed threads in a call to the OS to write to the network for the full 500ms of sampling:
::: {elk-elasticsearchwh381-eqx-1}{QrECOldSSzOUXsK_X61hLg}{XBKzBPUtRi-C7Xtcgl7kRA}{10.122.40.190}{10.122.40.190:9300}{xpack.installed=true, box_type=hot, zone=eqx_dal, gateway=false}
Hot threads at 2019-07-17T15:16:31.343Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
101.4% (506.9ms out of 500ms) cpu usage by thread 'elasticsearch[elk-elasticsearchwh381-eqx-1][[transport_server_worker.default]][T#26]'
10/10 snapshots sharing following 34 elements
sun.nio.ch.FileDispatcherImpl.write0(Native Method)
sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
sun.nio.ch.IOUtil.write(IOUtil.java:51)
sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:388)
io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
io.netty.handler.ssl.SslHandler.forceFlush(SslHandler.java:1667)
io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:735)
io.netty.handler.ssl.SslHandler.flush(SslHandler.java:712)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
io.netty.handler.logging.LoggingHandler.flush(LoggingHandler.java:265)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
io.netty.channel.AbstractChannelHandlerContext.access$1500(AbstractChannelHandlerContext.java:38)
io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1129)
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
java.lang.Thread.run(Thread.java:745)
This seems weird, and suggests something wrong with the network subsystem on this machine. Maybe bad config, maybe a broken driver or firmware? Elasticsearch opens many concurrent connections (11-13 between each pair of nodes in each direction, and there's 2 nodes on this host, so it should be dealing with 66 * 11 * 2 * 2 ≅ 3000 connections) which could well be enough to trigger some kind of issue.
To be entirely honest, my first guess like 6-8 weeks ago or
however long ago this started was "networking." However, I haven't found anything anamolous with this box, and have engaged network engineers at my company to confirm. Throughput capabilities are greater on this machine (and the other new one) and there doesn't appear to be any configuration differences. Additionally, we've verified there are no elevated TCP error rates a the local interfaces or at the TOR switch. MTU matches everywhere, manual throughput tests allow us to send much more data from one node to another than they ever do during normal operations, there is no evidence of TCP port exhaustion. I was a network engineer in a former life, so I know some of what to check for, and I was at a loss in terms of networking issues when I started this thread, but I can try to dig deep again.
That being said, I remember some kind of strangeness with the network config when I first built this box. I wish I could remember exactly what it was. I do remember it was just changing like a line or two in etc/sysconfig/network-scripts/and it seemed so minor I didn't think twice about it. I might just re-image this machine at this point, unless someone else has any ideas.
I've had the networking team and hardware teams examine everything they can think of, both at the hardware level and the system level, and they can find nothing wrong with the network subsystems or the network link on this server.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.