Periodic spikes in disk read IO leading to degraded performance

We are seeing a strange issue on one of our clusters where every several hours the read IOPS on some of the data nodes spike, causing a lot of iowait and degraded indexing performance. This can persist for several hours, but eventually it drops and performance returns to normal again.
iotop clearly shows the culprit to be the data node, but I have no idea what it is doing that is causing such a high load.
I should mention that all other indicators (CPU utilization, GC, heap size, etc.) are all within normal parameters, so it is most certainly the high iowait that is killing the performance.
Also, there are no queries running during these spikes so it is definitely something internal.

Our setup is 5 physical machines (each with 64GB of memory) - all of them are running a single data node (with 30GB heap size); in addition, machines 1-3 are running a master node (with 4GB heap size) and machines 4-5 are running an ingest node (with 12GB heap size).
We are running ES 5.4 on Linux SLES11SP4, JVM 1.8.0.92.
Swap is disabled and we are not over committing memory - each machine has plenty of FS cache to work with.

Our indices have 5 primary shards with a single replica.
Our most active index is a monthly index which currently contains over 700M documents and each replica is about 300GB in size.

From what I've read so far there could be three things causing massive read operations:

  1. Index refreshes
  2. ID Lookups (all of our IDs are custom)
  3. Merges

Our indices have a refresh interval of 30s, so I assume #1 is not really an issue.
ID Lookups could potentially explain it, but I would expect it to cause high disk load 100% of the time, not periodically.
Merges are the only other thing that could explain this behavior, except that running _nodes/stats doesn't show any merge operations executing on the overloaded nodes and there is no indexing throttling.

I have been struggling with this for several days and with no success - I'd appreciate if anyone could shed some light on what could cause these periodic spikes in disk reads.

Thanks,
Dan

I have the same problem in my cluster in ES 4.5, but I have no such problem before I migrate to 5.4(previously I used v1.7.2).

The only difference is in v1.7.2, I change some parameters for merge policy but in v5.4, the parameters seem not equally affected.
Did you change(tune) any parameter for merge policy?

No, I haven't changed any parameters but I believe many of the configuration options for merging that existed in v1/v2 were removed in v5.

Anyway, the high read spikes stopped entirely when we started indexing the data of our most active index to daily indices - so we have the same volume of indexing, except that now it goes to a daily index with 40M docs / 5 shards instead of a monthly index with 700M docs / 5 shards.

However I still don't understand where the bottleneck was coming from. Was it because there were only five shards? Would it make any difference if I increased the number of shards, given that we only have five physical machines?

I'd appreciate if someone from the Elastic team could suggest what could've been the issue with the index size that would cause intensive periodical disk reads.. :slight_smile:

I am facing a similar issue. What did you do finally?

This:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.