Newbie performance troubleshooting, high load spikes on ES nodes

Hi polyfractal, thank you for your reply,

I've moved the elasticsearch data volume to local storage on the hosts holding the VMs, it means we can't migrate the VMs around as easily but with enough ES nodes we can afford to take one down to move it if needed.

Two nodes have now gone 18 hours without a load spike, one still squawked in the middle of the night but its load average was 2, not 350. I can live with that. Kibana still breaks periodically, a "bad gateway" error that clears on reload of the page.

I'm working on merging indices, probably boiling each month together. What's a generally prudent maximum for doc count and primary size for a single index, assuming a three node cluster and one shard?

Or should we add nodes and/or up the shard count closer to the default? Now would be the time to do that as I reindex/merge.

Lastly (thank you for your patience as I veer off topic, I'm hoping to get as much help as I can while I have your attention), we're considering adding our LDAP logging to this stack, which will add up to ~15M events per day, about 4X our busiest days so far. We'll definitely add nodes. Should we also revisit the shard count? Again, this will likely be mostly write-only and queried only as needed for authentication troubleshooting and compliance audits.

Hope to hear from you,

Randy