Tl;dr - We have added additional capacity to our Elasticsearch cluster and now indexing is actually performing worse than before.
We have a fairly large ELK stack that runs across two data centers. Before I get to our exact problems, I want to give as much context as possible.
The basic breakdown of the setup is as follows:
- 18 Kafka nodes in each data center running on VMs.
- 50 Logstash indexers in each data center running on VMs.
- 3 Elasticsearch/nginx load balancers (Client-only) in each data center running on VMs.
- 3 Elasticsearch “controllers” (Master-only) in each data center running on VMS.
- 10 Elasticsearch “warm” servers (Data-only) in each data center running on bare metal with spinning platters.
- 4 Elasticsearch “hot” servers (Data-only) in each data center running on bare metal with SSDs.
So, 176 servers in total.
None of the VMs are overloaded or running too hot and we have no reason to believe the problems lie there. The problem is almost certainly with the data-only nodes. I only mention all of our other servers involved so people have context and know that we’ve almost certainly ruled out any problems there.
The “warm” servers are all PowerEdge R530s with 256GB of RAM and 48 HyperThreaded cores. Their disk layouts have two possibilities, and each represents about half of the fleet in each data center. Originally these machines were built with RAID5, and those servers have 17TB of available storage running off of 4 x 7200rpm disks. We’ve rebuilt just about half of these machines on top of RAID0 for increased write speed, and those servers have the exact same disks, but about 22TB of available storage.
The original “hot” servers are all PowerEdge R730s with 256GB of RAM, 56 HyperThreaded cores and 7TB of available storage on RAID0 built upon SSDs.
Recently we added 2 new “hot” servers, 1 to each data center. These machines are actually even stronger than the older “hot” servers. They’re also R730s, and they also have 256GB of RAM, but memory runs at 2400Mhz instead of 2133Mhz. They have 64 HyperThreaded cores instead of 56. They have the same amount of storage, but on a SAS bus instead of a SATA bus. They have 10Gb uplinks instead of 1Gb uplinks. In short, they’re just slightly better machines in every way.
All machines have the same OS, with only patch-level version difference in kernels. All machines have the same packages and library versions installed, excluding perhaps a minor troubleshooting tool installed here or there. All machines have essentially the exact same system configuration as demonstrated by diffing sysctl -a
output. No machines ever saturate their network links. There are no signifcant amount of TCP errors at either the servers or the switches.
All of our data-only servers run two Elasticsearch nodes, each configured with a 30GB heap, per guidance. The Elasticsearch nodes on the “warm” servers have been configured with 1 Lucene merge thread while the “hot” servers have been configured with 4, as per the official guidance. All nodes average below 600 shards, also per the official guidance, and the “hot” nodes average only about 110 shards per node, when all servers are accepting writes. All indices are sharded properly, and we have no shards larger than around 30GB at the most. We are not even close to running out of disk on any of the servers. The “hot” cluster as a whole processes about 5TB of data per day, which represents our busiest indices. After 3 days, these indices are moved to the “warm” cluster, which also handles about 2.7TB of incoming data from lower-trafficked indices every day.
Everything was running fine, processing an average of 85K log lines per second essentially all day every day, with some natural variance going up to around 100K per second and random spikes thrown in. With this setup we appear to be able to consistently process about 170K log lines per second at a sustained rate, when needed.
So, finally, the problem:
When we introduced the two newer “hot” servers, we started noticing periodic instability in our indexing rate. When things are good, our indexing almost perfectly matches our Kafka incoming messages. When things are bad, the indexing rate becomes jittery with very large spikes and dips. Additionally, when things are good we are able to weather very heavy queries without a problem, but when we’re in a bad state, very heavy queries can slow indexing to a crawl until we recover. During bad periods that last long enough, we can actually start to back up on processing to the point that things are no longer near-real-time, sometimes reaching a delay between message received and message indexed of 5-20 minutes. When things are good this delay is about 2 seconds. Luckily, these bad periods always eventually turn back into good periods, and we catch up very quickly.
It should also be noted that when we’re in a bad state, the load average on one or both of the new servers spikes well above average. The hot servers all average a load of around 7-9 during normal operations, but the load on the new hot servers will spike to 15+ for extended periods of time when we’re in a bad period. Nothing is ever eating up a significant portion of CPU besides the two Elasticsearch JVMs. This still shouldn’t really be a problem with how many cores they have available to them, but it’s a symptom worth noting, since it’s unique to these bad periods.
When we remove the newer “hot” servers from the cluster in terms of incoming logs, these problems all go away.
We have not been able to identify any consistent pattern in when these bad periods happen. Sometimes the jitter lasts for 5 minutes and sometimes for hours on end. When we check hot threads output on the “hot” servers during this time, it’s always Lucene merges, but it’s also always Lucene merges when things are good. There is no significant shard imbalance happening on the newer machines. We have diffed every config on these servers and have found zero differences. We are using CMS for garbage collection, and there is no correlating change in GC time when things are bad.
On a long shot, we lowered the number of cores each Elasticsearch node knows about (which would be half the cores) on the newer machines to match the older “hot” servers (from 32 to 28 cores per JVM). We’re still seeing these periods of high jitter and indexing problems. When we move incoming logs away from the new servers, the problems go away entirely.
Any ideas? It seems so counterintuitive that introducing any additional hardware to the cluster, much less better hardware, would cause problems, but that’s where we’re at. We consider ourselves fairly well versed in managing an ELK stack and no stranger to managing hardware. We have a whole slew of ideas about config tweaks that might improve overall performance, but none of those would help explain why adding new, potentially overpowered machines would make things worse as things currently stand.
Thanks in advance!