Is there a way to throttle or stagger ILM?

So, a system I'm building that uses ES heavily has been periodically not getting the data that it should out of ES. After digging in a bit, I finally noticed that there was a pattern.


The elastic agent queue depth keeps spiking periodically. A bit more digging and my logs showed me that those spikes are when ILM is rolling over indices and downsampling my data.

I'm guessing that a large part of the problem is that I'm running a single ES node. Long story short, we need to trim down as much as possible if we're going to keep using ES. So, increasing my ES nodes is not an ideal solution.

My thought would be to stagger ILM jobs somehow so it's doing a few at a time all day long instead of all of them all at once. Is there a way to do that?

My other (not ideal) thought would be to add extra processing nodes, while keeping only one master/data node, but would ILM even be able to run on a non-data node?

Any other ideas?

Thanks!

Hmm ILM-triggered activities should be trying to stay out of the way of your production workload, it sounds like we might need a bit more throttling on the downsampling action. Yet it's only supposed to use a tiny threadpool, 1/8th of your CPUs, so I wonder why it's having such a big impact.

Could you grab GET _nodes/hot_threads?threads=9999 from a time when it's struggling, and share it here (or likely on https://gist.github.com/ since it'll be too big)?

@DavidTurner Here are a few different runs of that command.

Thanks, that's helpful. Are you running on spinning disks or SSDs?

Data is on an iscsi lun backed by SSD's. I believe they are pretty fast SSD's as well. You ever hear of an Kaminario? That's what the storage is on.

Edit:

Also, possibly relevant, ES is running in a single node Docker stack service. We have 3 Docker Swarm nodes it can run on, so each of those nodes mounts the lun, and we have OCFS configured for the filesystem. The idea being if the 1 instance of ES has to be restarted on another node, it will be using the same data as the old instance.

Hmm. These stack dumps show that your system is heavily bottlenecked on IO, with many threads stuck for several hundreds of milliseconds waiting for a write() or similar to complete. I don't think your storage is performing as well as you think it should.

Well, I am 90% sure the issue is OCFS2. I moved the ES instance to a different server where I could use a normal xfs iscsi lun, and that seems to have resolved the issues I was having.

Thanks for the help @DavidTurner !

Ah yes that'd explain it indeed, thanks for closing the loop. Clustered filesystems seem to be a rich source of performance (and sometimes correctness) issues, and the complexity they add is largely unnecessary when Elasticsearch is also doing its own clustering and replication work. XFS is a better choice IMO.

1 Like

Closing another loop on the ES side, we still think it might be a good idea to limit the resources needed by downsampling anyway:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.