The elastic agent queue depth keeps spiking periodically. A bit more digging and my logs showed me that those spikes are when ILM is rolling over indices and downsampling my data.
I'm guessing that a large part of the problem is that I'm running a single ES node. Long story short, we need to trim down as much as possible if we're going to keep using ES. So, increasing my ES nodes is not an ideal solution.
My thought would be to stagger ILM jobs somehow so it's doing a few at a time all day long instead of all of them all at once. Is there a way to do that?
My other (not ideal) thought would be to add extra processing nodes, while keeping only one master/data node, but would ILM even be able to run on a non-data node?
Hmm ILM-triggered activities should be trying to stay out of the way of your production workload, it sounds like we might need a bit more throttling on the downsampling action. Yet it's only supposed to use a tiny threadpool, 1/8th of your CPUs, so I wonder why it's having such a big impact.
Could you grab GET _nodes/hot_threads?threads=9999 from a time when it's struggling, and share it here (or likely on https://gist.github.com/ since it'll be too big)?
Data is on an iscsi lun backed by SSD's. I believe they are pretty fast SSD's as well. You ever hear of an Kaminario? That's what the storage is on.
Also, possibly relevant, ES is running in a single node Docker stack service. We have 3 Docker Swarm nodes it can run on, so each of those nodes mounts the lun, and we have OCFS configured for the filesystem. The idea being if the 1 instance of ES has to be restarted on another node, it will be using the same data as the old instance.
Hmm. These stack dumps show that your system is heavily bottlenecked on IO, with many threads stuck for several hundreds of milliseconds waiting for a write() or similar to complete. I don't think your storage is performing as well as you think it should.
Ah yes that'd explain it indeed, thanks for closing the loop. Clustered filesystems seem to be a rich source of performance (and sometimes correctness) issues, and the complexity they add is largely unnecessary when Elasticsearch is also doing its own clustering and replication work. XFS is a better choice IMO.