Indexing Becomes Inconsistent and Problematic With Additional Capacity Added

After almost a week of perfect stability, suddenly the jitter has returned with a corresponding spike in load on wh381, one of the new servers and the one that always seems to be acting up the most. Grabbed hot-threads output.

For one of the Elasticsearch nodes running on the machine it's all transport_client_boss threads and on the other it's transport_worker_server.default

So, yeah. We're back to square one. I now believe that the jdk upgrade didn't actually fix anything. I believe it was just restarting the elasticsearch processes, and once they had been up for ~6 days again, the same problems have re-appeared. Indexing across all of the hot servers was again terrible this morning, and the very minute we shut down the elasticsearch processes on wh381 everything returned to being essentially perfect.

And, here is even more fun data:

So, we run two Elasticsearch processes on each server. The engineer that took down Elasticsearch on wh381 accidentally only took down one of the two. So, we went from the entire cluster unable to index properly to everything being perfect within about a minute of just one of the two Elasticsearch processes on this server going down.

Can you share the hot threads output, and identify which nodes are which in it? Specifically, which are the two pairs of nodes running on the new hosts and which one was the one that was shut down?

Sure. It's way too big to post in a comment, assuming you'd want to see hot threads output for more than just the problem server, since this impacts indexing on every node. Can you provide another upload link?

Sure, here you go: https://upload-staging.elstc.co/u/1713b783-c7b1-48e4-93fc-46a3bb0f77dc

Done and done. The first part of the file is a hot-threads from when we were looking bad. You'll notice there is both an elk-elasticsearchwh381-eqx-1 and an elk-elasticsearchwh381-eqx-2.

Then there are some blank lines and I separated the next output via ---. What follows is output from a hot-threads from just a few minutes ago where everything looks great. Only elk-elasticsearchwh381-2 is present from that server in that one.

I'm tempted to not care anymore. Everything is so perfect right now, including a few lengthy spikes to almost double traffic where indexing tracked perfectly, I might just say "This is the weird server" and leave it running only one node. Unless problems return, that is.

The engineer in me still wants a root cause, so if anyone has ideas I'd love to hear them, but I cannot justify too much more of my own time. Between myself and the other engineers that have been trying to figure this out for almost two months, our combined salary hours spent on this is starting to be pretty ridiculous.

In the "slow" hot threads dump, both of the elk-elasticsearchwh381-eqx nodes are at 100% CPU on the three listed threads in a call to the OS to write to the network for the full 500ms of sampling:

::: {elk-elasticsearchwh381-eqx-1}{QrECOldSSzOUXsK_X61hLg}{XBKzBPUtRi-C7Xtcgl7kRA}{10.122.40.190}{10.122.40.190:9300}{xpack.installed=true, box_type=hot, zone=eqx_dal, gateway=false}
   Hot threads at 2019-07-17T15:16:31.343Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

   101.4% (506.9ms out of 500ms) cpu usage by thread 'elasticsearch[elk-elasticsearchwh381-eqx-1][[transport_server_worker.default]][T#26]'
     10/10 snapshots sharing following 34 elements
       sun.nio.ch.FileDispatcherImpl.write0(Native Method)
       sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
       sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
       sun.nio.ch.IOUtil.write(IOUtil.java:51)
       sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
       io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:388)
       io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
       io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
       io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
       io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
       io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
       io.netty.handler.ssl.SslHandler.forceFlush(SslHandler.java:1667)
       io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:735)
       io.netty.handler.ssl.SslHandler.flush(SslHandler.java:712)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
       io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
       io.netty.handler.logging.LoggingHandler.flush(LoggingHandler.java:265)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
       io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
       io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
       io.netty.channel.AbstractChannelHandlerContext.access$1500(AbstractChannelHandlerContext.java:38)
       io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1129)
       io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
       io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
       io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
       io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
       io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
       java.lang.Thread.run(Thread.java:745)

This seems weird, and suggests something wrong with the network subsystem on this machine. Maybe bad config, maybe a broken driver or firmware? Elasticsearch opens many concurrent connections (11-13 between each pair of nodes in each direction, and there's 2 nodes on this host, so it should be dealing with 66 * 11 * 2 * 2 ≅ 3000 connections) which could well be enough to trigger some kind of issue.

To be entirely honest, my first guess like 6-8 weeks ago or
however long ago this started was "networking." However, I haven't found anything anamolous with this box, and have engaged network engineers at my company to confirm. Throughput capabilities are greater on this machine (and the other new one) and there doesn't appear to be any configuration differences. Additionally, we've verified there are no elevated TCP error rates a the local interfaces or at the TOR switch. MTU matches everywhere, manual throughput tests allow us to send much more data from one node to another than they ever do during normal operations, there is no evidence of TCP port exhaustion. I was a network engineer in a former life, so I know some of what to check for, and I was at a loss in terms of networking issues when I started this thread, but I can try to dig deep again.

That being said, I remember some kind of strangeness with the network config when I first built this box. I wish I could remember exactly what it was. I do remember it was just changing like a line or two in etc/sysconfig/network-scripts/and it seemed so minor I didn't think twice about it. I might just re-image this machine at this point, unless someone else has any ideas.

I've had the networking team and hardware teams examine everything they can think of, both at the hardware level and the system level, and they can find nothing wrong with the network subsystems or the network link on this server.

I have rebuilt this machine from scratch. It re-enters the cluster at midnight UTC.

It is also worth noting that all of the hot servers are now at jdk1.8.0_111 but haven't had time to get to all of the warm servers yet.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.