After @Cesar_Mejia investigation, I agree with StephenB Jedi to move LS heavy loads to another host, at least temporarily.
Again, check grok processing times, maybe that can be optimized.
Agreed.
But the gaps in the monitoring graphs suggests to me that indexing is completely stalling under load, and
likely has not enough IOPS for peak ingesting load, not enough that would be required to keep up. This creates back pressure .... and it takes a while to recover.
What to do about it? Get/use faster storage.
Edit: If using Linux, you can check IO perf using iostat -x <N>, say N=1 or 10, while under load. I suspect you will see high %util and await times on the device corresponding to the RAID volume.
One way would be to let Dell sell you some at-least-slightly-greater capacity SSDs that just slide into the server, let the RAID (unless it is RAD0) rebuild, and rinse and repeat til all disks are replaced. 101 other ways too. Also check where the significant IOPS are being performed, its relatively easy to have something writing to OS volume (which might also be slow HDDs) by mistake, especially when same server is hosting a few services. And de-coupling logstash from elasticsearch is just a better design IMO. I'm old school, but 3x ES instances (which are competing with each other!?) AND logstash on same physical server does not rock my boat either.
Speak of... has anyone tested ELK performances on NVMe disks? Is it significant better and reliable on the longer period?
@rios is NVMe Significantly better than what?
Elastic cloud runs tens of thousands of clusters on with local NVME and the IO Throughput is significantly better than remote SSD or spinning disc or most other storages.
Of course it all comes down to specific use cases, but yes, in general nvme is better performance than most options...
Better performance but it may not be better cost-to-performance for your specific use case if the high IO is not required.
Yes, as you compare SSD or older disk technologies. NVMe is a relatively new and yes they are very fast.
For a few years, when SSDs were first introduced, they were more for laptops/PC rather than servers. Nowadays SSD are so reliable, nobody has ask/think about SSDs reliability to work 7/24/365.
Thank you for sharing valuable information.
all solid state storage will have a lifetime, as underneath there are just NAND cells which can be erased/written only some finite number of times. The NVMe protocol, typically over PCIe, is just faster, higher throughput, lower latency, higher parallelism, than traditional SATA/SAS/IDE/SCSI interfaces/protocols, mostly because they were designed with spinning disks in mind, and NVMe was designed specifically for solid state devices.
But pulling it back to @Cesar_Mejia 's issue, maybe he's had a chance to capture some iostat diagnostics? If changing the storage is not an option, and if I am right that the slow HDD storage is his main problem, then what else could he do?
@Cesar_Mejia You mentioned 3 elasticsearch instances? What was motivation there? These instances are via virtual machines or containers, or just with different settings in elasticsearch.yml so that they co-exist on same machine? Are the data directories all on same RAID volume? My fear is you have essentially relatively slow storage, then multiple things competing for the same IOPS, making a non-ideal situation worse.