I'm running a 3-node RELK cluster that's mostly storing application logs coming in via a syslog listener (logstash shipper). We have ~1800 indices ranging from thousands of docs/events and trivial size to some (not many) with 1-2GB primary size and 3-4M docs on days where something was puking errors.
We're set for one replica. We run one shard, consistent with best practices when this cluster was built, as this is ostensibly mostly an archive, but I suspect our devs are querying the data more than we planned on. I recently installed x-pack monitoring and it shows search rates around ~25/s with occasional spikes to <200, indexing rates no higher than ~450/s.
The nodes are all linux VMs with a separate VHD for ES data, all served from an NFS datastore. I know that's not best practice but also that our application is tiny compared to the scale many shops run ELK at. Every metric I can find on the VMs shows no significant waits for disk IO.
Here's the problem: our mostly idle nodes regularly spike short and crazy load averages, up to 375 for a minute or two and then back below 1. During these spikes, latency jumps into the seconds, kibana times out, and our nagios alerts for load averages all go off. Sometimes there's a load spike at the same time as a search spike, but often a load spike has no search/index spike at the same time.
I found a reference on this forum to ES performance problems and openjdk: we're running the Oracle java.
I've found other references that generally say if you're having ES performance problems, throw another node on the pile. I could also try increasing my shard count, merging my old indices to lower the index count, or moving those data disks to local storage on their hosts. But I'd like to have better proof of the problem before I resort to trial and error.
I'd be grateful for any guidance on troubleshooting and metrics...
Randy in Seattle