Newbie performance troubleshooting, high load spikes on ES nodes

Hi All,

I'm running a 3-node RELK cluster that's mostly storing application logs coming in via a syslog listener (logstash shipper). We have ~1800 indices ranging from thousands of docs/events and trivial size to some (not many) with 1-2GB primary size and 3-4M docs on days where something was puking errors.

We're set for one replica. We run one shard, consistent with best practices when this cluster was built, as this is ostensibly mostly an archive, but I suspect our devs are querying the data more than we planned on. I recently installed x-pack monitoring and it shows search rates around ~25/s with occasional spikes to <200, indexing rates no higher than ~450/s.

The nodes are all linux VMs with a separate VHD for ES data, all served from an NFS datastore. I know that's not best practice but also that our application is tiny compared to the scale many shops run ELK at. Every metric I can find on the VMs shows no significant waits for disk IO.

Here's the problem: our mostly idle nodes regularly spike short and crazy load averages, up to 375 for a minute or two and then back below 1. During these spikes, latency jumps into the seconds, kibana times out, and our nagios alerts for load averages all go off. Sometimes there's a load spike at the same time as a search spike, but often a load spike has no search/index spike at the same time.

I found a reference on this forum to ES performance problems and openjdk: we're running the Oracle java.

I've found other references that generally say if you're having ES performance problems, throw another node on the pile. I could also try increasing my shard count, merging my old indices to lower the index count, or moving those data disks to local storage on their hosts. But I'd like to have better proof of the problem before I resort to trial and error.

I'd be grateful for any guidance on troubleshooting and metrics...

Randy in Seattle

Forgot to add this. Node metrics in kibana/monitoring show this for all three ES nodes:

Is this normal, some process loading and flushing cache, for example? Or is something crashing and reloading every 21 minutes like clockwork? /var/log/elasticsearch/CLUSTER.log shows nothing, would a problem log somewhere else?

No worries about OpenJDK. I'm not aware of any outstanding performance issues.

That heap chart looks very healthy actually. What happens is that the JVM produces garbage at a fairly regular rate in your cluster (leftover objects from indexing, search, etc). The garbage slowly begins to fill up the heap, but you have plenty of heap space and the jvm knows it's garbage... so it just lets it accumulate.

At some point (we configure it to be 70% of the heap), a garbage collection is initiated which quickly collects are the garbage and returns the heap back to it's "in-use" size, which looks to be around 2.8gb. The fast drop is a good sign, that means the JVM was able to quickly collect all the garbage without much work.

Hard to say really, I think you'll have to go digging for more stats. Some ideas:

  • Is it disk IO related? E.g. your IO is capped out during a big merge or something?
  • Are there shards moving around during the spike?
  • Are there any pending tasks on the master during the spike? Do you see a big GC event during the spike?
  • anything interesting in the hot_threads during the spike?

If I had to guess, I'd wager this is at least part of the problem (if not the problem). NFS is crazy unreliable. There's the network latency, which makes life difficult for databases that expect IO to be fast (or at least, faster than network). Even high performance SANs have difficulties with this.

Then there are NFS semantics itself, which require roundtrips for every operation. Combined with network latency simple disk IO just becomes intolerably slow.

It's also possible to configure NFS servers/clients in such a way that makes synchronization semantics... interesting... for datastores. NFS can introduce caching, delay syncs, say it synced but didn't really, etc. I don't know enough about NFS unfortunately to provide deep insight here, just things I've heard through the grapevine from more capable developers.

You may also need to tweak how your NFS is mounted, iirc rsize/wsize can have a big impact on performance.

This is not an optimal setup either, especially with only three nodes. Each shard carries a certain amount of overhead, so with that many indices/shards you're wasting a lot in overhead. Shards can hold quite a bit of data. There's no single answer, but for example I don't blink at throwing 10-50m docs into an index on my laptop.

I'd revisit your index strategy and try to combine the data into fewer indices. That'll make the cluster happier (fewer shards to manage), better compress, and a happier master node (smaller changes to the cluster state when something changes)

1 Like

Hi polyfractal, thank you for your reply,

I've moved the elasticsearch data volume to local storage on the hosts holding the VMs, it means we can't migrate the VMs around as easily but with enough ES nodes we can afford to take one down to move it if needed.

Two nodes have now gone 18 hours without a load spike, one still squawked in the middle of the night but its load average was 2, not 350. I can live with that. Kibana still breaks periodically, a "bad gateway" error that clears on reload of the page.

I'm working on merging indices, probably boiling each month together. What's a generally prudent maximum for doc count and primary size for a single index, assuming a three node cluster and one shard?

Or should we add nodes and/or up the shard count closer to the default? Now would be the time to do that as I reindex/merge.

Lastly (thank you for your patience as I veer off topic, I'm hoping to get as much help as I can while I have your attention), we're considering adding our LDAP logging to this stack, which will add up to ~15M events per day, about 4X our busiest days so far. We'll definitely add nodes. Should we also revisit the shard count? Again, this will likely be mostly write-only and queried only as needed for authentication troubleshooting and compliance audits.

Hope to hear from you,

Randy

Good news! I guess it was the attached storage then. FWIW, there are lots of people that do manage to get SAN's to work... but it takes a fair amount of work to make sure they are 100% rock steady. From my experience it's often less work to just deal with local storage and handle the complexity of migration/backups/etc from local storage.

Unsure about Kibana, could possibly be long query timeouts (iirc Kibana has a 30-60s timeout or something, but not an expert in this area).

Hard to say, it's pretty dependent on environment, hardware, types of queries, etc. We generally recommend working up some benchmarks with your data and hardware to determine what makes sense. E.g. setup some SLAs (10,000 docs/s, and no greater than 200ms latency, or whatever), pump data into the cluster and periodically check to make sure the SLAs aren't breached.

Here's an article that touches on some of the considerations around shard sizing: How many shards should I have in my Elasticsearch cluster? | Elastic Blog

Generally, you'll hear sizes of 10-50gb per shard for time-based data. It's highly variable, but a reasonable ballpark to give you an idea.

An option to help manage the shards is the Rollover API which lets ES manage the rolling process based on criteria (like total size) rather than arbitrary time-limits. That can help if data is bursty. E.g. a slow day might only have a few docs, but rolling daily will create a new index regardless. Rolling based on size makes it so new indices are only created when needed.

Some folks will also run a setup where they have many primary shards (3 in your case, one for each node) to maximize indexing performance due to splitting IO between all the nodes. Then once the time frame is "done" (e.g. the next day for daily indices), they shrink the index to a single shard using the Shrink API, which optimizes number of shards, reduces overhead and generally improves query performance.

I wouldn't worry about the shard default... that number is largely historical and just a feature to help people scale easily. Now that we have the ability to split shards it's even less important, since you can split the shard up in the future if needed. We even have a PR open right now to drop the default shard size to 1.

Adding nodes is trickier, it comes back to those benchmarks. If you can't meet your SLA you may have to add more horsepower to the cluster in the form of nodes. For indexing-heavy operations it often boils down to needing more IOPs to write the data to disk more than anything, so more nodes == more IOPs.

Ditto to the LDAP question. Since you know you'll be adding a fair amount more load in the future, I'd probably invest some time up front to see what kind of throughput your 3-node cluster can handle so you can extrapolate into the future. ES scales fairly linearly with indexing, so if your three nodes can handle X today, figuring out Y in the future is pretty straightforward with node count.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.