Indexing Becomes Inconsistent and Problematic With Additional Capacity Added

Tl;dr - We have added additional capacity to our Elasticsearch cluster and now indexing is actually performing worse than before.

We have a fairly large ELK stack that runs across two data centers. Before I get to our exact problems, I want to give as much context as possible.

The basic breakdown of the setup is as follows:

  • 18 Kafka nodes in each data center running on VMs.
  • 50 Logstash indexers in each data center running on VMs.
  • 3 Elasticsearch/nginx load balancers (Client-only) in each data center running on VMs.
  • 3 Elasticsearch “controllers” (Master-only) in each data center running on VMS.
  • 10 Elasticsearch “warm” servers (Data-only) in each data center running on bare metal with spinning platters.
  • 4 Elasticsearch “hot” servers (Data-only) in each data center running on bare metal with SSDs.

So, 176 servers in total.

None of the VMs are overloaded or running too hot and we have no reason to believe the problems lie there. The problem is almost certainly with the data-only nodes. I only mention all of our other servers involved so people have context and know that we’ve almost certainly ruled out any problems there.

The “warm” servers are all PowerEdge R530s with 256GB of RAM and 48 HyperThreaded cores. Their disk layouts have two possibilities, and each represents about half of the fleet in each data center. Originally these machines were built with RAID5, and those servers have 17TB of available storage running off of 4 x 7200rpm disks. We’ve rebuilt just about half of these machines on top of RAID0 for increased write speed, and those servers have the exact same disks, but about 22TB of available storage.

The original “hot” servers are all PowerEdge R730s with 256GB of RAM, 56 HyperThreaded cores and 7TB of available storage on RAID0 built upon SSDs.

Recently we added 2 new “hot” servers, 1 to each data center. These machines are actually even stronger than the older “hot” servers. They’re also R730s, and they also have 256GB of RAM, but memory runs at 2400Mhz instead of 2133Mhz. They have 64 HyperThreaded cores instead of 56. They have the same amount of storage, but on a SAS bus instead of a SATA bus. They have 10Gb uplinks instead of 1Gb uplinks. In short, they’re just slightly better machines in every way.

All machines have the same OS, with only patch-level version difference in kernels. All machines have the same packages and library versions installed, excluding perhaps a minor troubleshooting tool installed here or there. All machines have essentially the exact same system configuration as demonstrated by diffing sysctl -a output. No machines ever saturate their network links. There are no signifcant amount of TCP errors at either the servers or the switches.

All of our data-only servers run two Elasticsearch nodes, each configured with a 30GB heap, per guidance. The Elasticsearch nodes on the “warm” servers have been configured with 1 Lucene merge thread while the “hot” servers have been configured with 4, as per the official guidance. All nodes average below 600 shards, also per the official guidance, and the “hot” nodes average only about 110 shards per node, when all servers are accepting writes. All indices are sharded properly, and we have no shards larger than around 30GB at the most. We are not even close to running out of disk on any of the servers. The “hot” cluster as a whole processes about 5TB of data per day, which represents our busiest indices. After 3 days, these indices are moved to the “warm” cluster, which also handles about 2.7TB of incoming data from lower-trafficked indices every day.

Everything was running fine, processing an average of 85K log lines per second essentially all day every day, with some natural variance going up to around 100K per second and random spikes thrown in. With this setup we appear to be able to consistently process about 170K log lines per second at a sustained rate, when needed.

So, finally, the problem:

When we introduced the two newer “hot” servers, we started noticing periodic instability in our indexing rate. When things are good, our indexing almost perfectly matches our Kafka incoming messages. When things are bad, the indexing rate becomes jittery with very large spikes and dips. Additionally, when things are good we are able to weather very heavy queries without a problem, but when we’re in a bad state, very heavy queries can slow indexing to a crawl until we recover. During bad periods that last long enough, we can actually start to back up on processing to the point that things are no longer near-real-time, sometimes reaching a delay between message received and message indexed of 5-20 minutes. When things are good this delay is about 2 seconds. Luckily, these bad periods always eventually turn back into good periods, and we catch up very quickly.

It should also be noted that when we’re in a bad state, the load average on one or both of the new servers spikes well above average. The hot servers all average a load of around 7-9 during normal operations, but the load on the new hot servers will spike to 15+ for extended periods of time when we’re in a bad period. Nothing is ever eating up a significant portion of CPU besides the two Elasticsearch JVMs. This still shouldn’t really be a problem with how many cores they have available to them, but it’s a symptom worth noting, since it’s unique to these bad periods.

When we remove the newer “hot” servers from the cluster in terms of incoming logs, these problems all go away.

We have not been able to identify any consistent pattern in when these bad periods happen. Sometimes the jitter lasts for 5 minutes and sometimes for hours on end. When we check hot threads output on the “hot” servers during this time, it’s always Lucene merges, but it’s also always Lucene merges when things are good. There is no significant shard imbalance happening on the newer machines. We have diffed every config on these servers and have found zero differences. We are using CMS for garbage collection, and there is no correlating change in GC time when things are bad.

On a long shot, we lowered the number of cores each Elasticsearch node knows about (which would be half the cores) on the newer machines to match the older “hot” servers (from 32 to 28 cores per JVM). We’re still seeing these periods of high jitter and indexing problems. When we move incoming logs away from the new servers, the problems go away entirely.

Any ideas? It seems so counterintuitive that introducing any additional hardware to the cluster, much less better hardware, would cause problems, but that’s where we’re at. We consider ourselves fairly well versed in managing an ELK stack and no stranger to managing hardware. We have a whole slew of ideas about config tweaks that might improve overall performance, but none of those would help explain why adding new, potentially overpowered machines would make things worse as things currently stand.

Thanks in advance!

Other things I should have mentioned:

  • We're running 6.3.2
  • Lucene version is 7.3.1
  • We have no been able to spot any change in log message types in logstash logs during "bad" periods
  • There are no master elections taking place randomly, only when we ask for one to happen
  • This has been going on for about 8 weeks

Thanks for the detailed writeup. It helps to rule out a lot of common issues. I fear that debugging a lot further might take more time and effort than we can reasonably give here on the community forums. It sounds very much related to the two new machines rather than something within Elasticsearch itself.

You've covered CPU, GC, and network usage, but a mention of IO profiling is conspicuously absent. I think IO performance issues might reasonably explain the effect you're seeing, so that's what I'd suggest to look at next. Does IO perform worse in the bad periods than the good? Is there more IO? How does the IO profile compare on the new nodes to the old nodes? Are the new nodes doing more IO than the old ones? Do you see any notable differences in IO when the new nodes are excluded? If you simulate Elasticsearch's IO load using something like fio can you find a measurable difference in performance between the two classes of node?

Hello @Sometimes

Elasticsearch logs are clean?

what's the latency between the new machines across the datacenters?

Disk IO looks fine at all times. These are SSDs that are never taxed. There is not noticeable difference during bad time and good times. Total IO by each node depends on shard balance on any given day.

Yup. Nothing in the logs. A handful of GC overhead > 250ms warnings, but I really mean only a few here or there and nothing that correlates to good times vs bad times.

We just decreased our refresh interval from 58s to 20s to 1s, as per advice from someone else. This has raised load and GC time across all nodes, but doesn't appear to have fixed or made indexing rate worse.

Oh, additionally, we turned down half of our indexers, as someone else noted we probably didn't need that many and it could be causing connection overload. So, we're down to 25 in each datacenter, have seen no change in indexing rate and actually saw the load on the load balancers drop.

Thought it might be useful for people to see what I mean with "indexing jitter". Below is a graph showing what things look like when they're "bad." When things are good, they look exactly like what the first ~5 minutes of this graph looks like -- indexing rate tracks incoming messages almost perfectly.

This morning at one point we were seeing dips down to only 5K/s indexed. Nothing in the logstash logs, nothing in the elasticsearch logs. transport_client_boss threads eating up all the CPU on the problematic nodes.

Can you clarify "fine"? For instance, are IO queues essentially empty on every node?

Hmm. What are they doing? How busy are they? I perhaps misinterpreted your original post as meaning "nothing is ever eating up a significant portion of CPU". The nodes hot threads API will give you some stack traces that it'd be useful to share here. Use GET /_nodes/hot_threads?threads=9999 to show us a full picture.

Do the JVM versions match on all the nodes? What is the JVM are you using, and what version is it?

Yeah, sorry. I made a mistake not being clear enough about CPU usage. Sorry, it's a lot of state to try and translate to words. :smiley:

In terms of "Disk IO looks fine" the mean across all of our SSD nodes for the day so far (UTC) according to iostat-based metrics is:

  • 9.3 %user
  • 0.0 %nice
  • 2.3 %system
  • 0.04 %wait
  • 0.0 %steal
  • 88.3 %idle

I'll grab you hot threads output in a little bit. Got a few meetings in the way first.

What's the preferred way to get you the hot_threads output? It's 25800 lines and 2.2M

I took the two newer, more powerful machines out of the cluster again because things were bad, and everything looks perfect again. This is really frustrating. :neutral_face:

You can use this link to upload the hot threads dump: https://upload-staging.elstc.co/u/15b0b580-96bd-4da1-994b-a1a4845f542f. Could you also share a hot threads dump from when the cluster isn't suffering from the problem, but is under similar load?

Also it'd be good to know about the JVMs:

We're running the following on all nodes:

java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)

Which, checking the support matrix, is technically barely out of date for support of 6.3 -- I guess I could try upgrading to JDK10 on just one of the problem nodes and bring it back in.

Also, uploaded hot_threads output from during a bad period. I can run one in a few since things look good, but not sure you'll get the same insight since the problem nodes are currently down.

https://upload-staging.elstc.co/d/9f085a00e4ba3376dcdf1958aa5adf4011f89bf15242c2f710ebc0c364418b74

Yes, JVMs prior to 1.8.0u111 apparently have issues. It's worth upgrading that before trying anything else.

Alright, one of the problem servers is now running both of it's nodes on 1.8.0u111. It'll start processing incoming logs at midnight UTC tonight. We'll see how it goes.

So, I updated to jdk1.8.0_111 on one of the problematic nodes, and gave it ~half of the shards it normally gets, and it has been fine all day. It will be starting off with a normal daily load of shards at midnight UTC tonight. The only problems I've seen appear to be related to the other problem node, but it's currently in our secondary data center which does orders of magnitude less traffic.

Will have a lot more data this time tomorrow, but I'm cautiously optimistic so far.

1 Like

10

I feel pretty good about these numbers. There is natural variance in the incoming logs of course, so we're going to see movement on indexing rate, by this pretty much proves things have smoothed out considerably.

Thanks, David! Can't believe it ended up being just an upgrade from 102 to 111 on a single node, but what can you do?. I'll be updating all the rest as soon as I can.

1 Like

Cool, thanks for letting us know.