Performance problem because of read IOPS increase

Hi, I've a elasticsearch cluster with 3 nodes (version 6.8.0). It's deployed on 3 virtual machines with 16GB RAM and 4 cores, in Google Cloud.
Usually we have a lot more wirting IOPS than reading (40-50 vs 0-5) without any problem.
On two occasions readings have spiked causing IO wait problems, affecting performance to the point of having to shut down the troubled server. This situation occurred during off-peak hours when the volume of consultations was low.
Is there any way to know what causes these read IOPS peaks?
Thanks in advance.
Javier

Few things here:

  1. 6.8.0 is really really old and you should look at upgrading.
  2. What type of Backend storage are you using on GCP? 40-50 IOPS isn't a lot.
  3. Write/Read IOPS ratio don't mean too much, write will always be high if you're writing, but if you have enough RAM, your Read IOPS might be offset by this (or you could just not be searching all that much)
  4. If the IOPS are happening at non-load times, it is possible that Elasticsearch is doing background merging or some other background task.

Unfortunately we have dependencies that, at the moment, do not allow us to upgrade the version.
As for the disks, we were using pd-standard, but we upgraded to balanced disks, to avid this problem actually.
40-50 IOPS it's not much, and is the usual operation, but when we have this problem whe have peaks of 390 IOPS (358 read + 34 write)
What kind of background task may be doing elastic to suddenly scale up to 358 read operatios per second?
Is there any way I can check what operation was elastic doing?

(Disclaimer, I'm not familiar with GCP storage, so just referencing the docs on this area)

we were using pd-standard

That appears to be HDD backed disks, which would definitely have low IOPS and probably not suitable for content/hot nodes in Elasticsearch.

390 IOPS (358 read + 34 write)

Doesn't seem like it should be a problem for pd-balanced.

This IOPS patterns seems like it could be background merging.

Do you have stack monitoring setup? Can you check the number of segments on your cluster around this time? Do they start going down?

There might be some other causes here, but 6.8 is at the point where I don't recall much about it anymore.

Thanks for your answer again. With the new disks this shouldn't happen again, but we wanted to track the problem source. I'll take a look to your recommendations

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.