Indexing Becomes Inconsistent and Problematic With Additional Capacity Added

Sometimes · July 17, 2019, 3:26pm

After almost a week of perfect stability, suddenly the jitter has returned with a corresponding spike in load on wh381, one of the new servers and the one that always seems to be acting up the most. Grabbed hot-threads output.

For one of the Elasticsearch nodes running on the machine it's all transport_client_boss threads and on the other it's transport_worker_server.default

Sometimes · July 18, 2019, 6:07pm

So, yeah. We're back to square one. I now believe that the jdk upgrade didn't actually fix anything. I believe it was just restarting the elasticsearch processes, and once they had been up for ~6 days again, the same problems have re-appeared. Indexing across all of the hot servers was again terrible this morning, and the very minute we shut down the elasticsearch processes on wh381 everything returned to being essentially perfect.

Sometimes · July 18, 2019, 7:50pm

And, here is even more fun data:

So, we run two Elasticsearch processes on each server. The engineer that took down Elasticsearch on wh381 accidentally only took down one of the two. So, we went from the entire cluster unable to index properly to everything being perfect within about a minute of just one of the two Elasticsearch processes on this server going down.

DavidTurner · July 18, 2019, 9:18pm

Can you share the hot threads output, and identify which nodes are which in it? Specifically, which are the two pairs of nodes running on the new hosts and which one was the one that was shut down?

Sometimes · July 18, 2019, 9:22pm

Sure. It's way too big to post in a comment, assuming you'd want to see hot threads output for more than just the problem server, since this impacts indexing on every node. Can you provide another upload link?

DavidTurner · July 19, 2019, 5:27am

Sure, here you go: https://upload-staging.elstc.co/u/1713b783-c7b1-48e4-93fc-46a3bb0f77dc

Sometimes · July 19, 2019, 3:14pm

Done and done. The first part of the file is a hot-threads from when we were looking bad. You'll notice there is both an elk-elasticsearchwh381-eqx-1 and an elk-elasticsearchwh381-eqx-2.

Then there are some blank lines and I separated the next output via ---. What follows is output from a hot-threads from just a few minutes ago where everything looks great. Only elk-elasticsearchwh381-2 is present from that server in that one.

Sometimes · July 19, 2019, 9:10pm

I'm tempted to not care anymore. Everything is so perfect right now, including a few lengthy spikes to almost double traffic where indexing tracked perfectly, I might just say "This is the weird server" and leave it running only one node. Unless problems return, that is.

The engineer in me still wants a root cause, so if anyone has ideas I'd love to hear them, but I cannot justify too much more of my own time. Between myself and the other engineers that have been trying to figure this out for almost two months, our combined salary hours spent on this is starting to be pretty ridiculous.

DavidTurner · July 20, 2019, 5:24am

In the "slow" hot threads dump, both of the elk-elasticsearchwh381-eqx nodes are at 100% CPU on the three listed threads in a call to the OS to write to the network for the full 500ms of sampling:

::: {elk-elasticsearchwh381-eqx-1}{QrECOldSSzOUXsK_X61hLg}{XBKzBPUtRi-C7Xtcgl7kRA}{10.122.40.190}{10.122.40.190:9300}{xpack.installed=true, box_type=hot, zone=eqx_dal, gateway=false}
   Hot threads at 2019-07-17T15:16:31.343Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

   101.4% (506.9ms out of 500ms) cpu usage by thread 'elasticsearch[elk-elasticsearchwh381-eqx-1][[transport_server_worker.default]][T#26]'
     10/10 snapshots sharing following 34 elements
       sun.nio.ch.FileDispatcherImpl.write0(Native Method)
       sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
       sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
       sun.nio.ch.IOUtil.write(IOUtil.java:51)
       sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
       io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:388)
       io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
       io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
       io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
       io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
       io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
       io.netty.handler.ssl.SslHandler.forceFlush(SslHandler.java:1667)
       io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:735)
       io.netty.handler.ssl.SslHandler.flush(SslHandler.java:712)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
       io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
       io.netty.handler.logging.LoggingHandler.flush(LoggingHandler.java:265)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
       io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
       io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
       io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
       io.netty.channel.AbstractChannelHandlerContext.access$1500(AbstractChannelHandlerContext.java:38)
       io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1129)
       io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
       io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
       io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
       io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
       io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
       java.lang.Thread.run(Thread.java:745)

This seems weird, and suggests something wrong with the network subsystem on this machine. Maybe bad config, maybe a broken driver or firmware? Elasticsearch opens many concurrent connections (11-13 between each pair of nodes in each direction, and there's 2 nodes on this host, so it should be dealing with 66 * 11 * 2 * 2 ≅ 3000 connections) which could well be enough to trigger some kind of issue.

Sometimes · July 22, 2019, 9:40pm

To be entirely honest, my first guess like 6-8 weeks ago or
however long ago this started was "networking." However, I haven't found anything anamolous with this box, and have engaged network engineers at my company to confirm. Throughput capabilities are greater on this machine (and the other new one) and there doesn't appear to be any configuration differences. Additionally, we've verified there are no elevated TCP error rates a the local interfaces or at the TOR switch. MTU matches everywhere, manual throughput tests allow us to send much more data from one node to another than they ever do during normal operations, there is no evidence of TCP port exhaustion. I was a network engineer in a former life, so I know some of what to check for, and I was at a loss in terms of networking issues when I started this thread, but I can try to dig deep again.

That being said, I remember some kind of strangeness with the network config when I first built this box. I wish I could remember exactly what it was. I do remember it was just changing like a line or two in etc/sysconfig/network-scripts/and it seemed so minor I didn't think twice about it. I might just re-image this machine at this point, unless someone else has any ideas.

Sometimes · July 24, 2019, 3:17pm

I've had the networking team and hardware teams examine everything they can think of, both at the hardware level and the system level, and they can find nothing wrong with the network subsystems or the network link on this server.

Sometimes · July 25, 2019, 9:16pm

I have rebuilt this machine from scratch. It re-enters the cluster at midnight UTC.

It is also worth noting that all of the hot servers are now at jdk1.8.0_111 but haven't had time to get to all of the warm servers yet.

system · August 22, 2019, 9:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster not able to keep up? Elasticsearch	12	4230	July 6, 2017
Elasticsearch 7.17.10 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS Elasticsearch	53	1504	June 22, 2023
Rapidly Degrading Bulk Indexing Performance Elasticsearch	7	400	July 6, 2017
Elasticsearch 8.9.1 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS using ECK Elasticsearch	11	930	October 30, 2023
Indexing is being throttled Elasticsearch	7	2680	July 6, 2017

Indexing Becomes Inconsistent and Problematic With Additional Capacity Added

Related topics