We are seeing a strange issue on one of our clusters where every several hours the read IOPS on some of the data nodes spike, causing a lot of iowait and degraded indexing performance. This can persist for several hours, but eventually it drops and performance returns to normal again.
iotop clearly shows the culprit to be the data node, but I have no idea what it is doing that is causing such a high load.
I should mention that all other indicators (CPU utilization, GC, heap size, etc.) are all within normal parameters, so it is most certainly the high iowait that is killing the performance.
Also, there are no queries running during these spikes so it is definitely something internal.
Our setup is 5 physical machines (each with 64GB of memory) - all of them are running a single data node (with 30GB heap size); in addition, machines 1-3 are running a master node (with 4GB heap size) and machines 4-5 are running an ingest node (with 12GB heap size).
We are running ES 5.4 on Linux SLES11SP4, JVM 1.8.0.92.
Swap is disabled and we are not over committing memory - each machine has plenty of FS cache to work with.
Our indices have 5 primary shards with a single replica.
Our most active index is a monthly index which currently contains over 700M documents and each replica is about 300GB in size.
From what I've read so far there could be three things causing massive read operations:
- Index refreshes
- ID Lookups (all of our IDs are custom)
- Merges
Our indices have a refresh interval of 30s, so I assume #1 is not really an issue.
ID Lookups could potentially explain it, but I would expect it to cause high disk load 100% of the time, not periodically.
Merges are the only other thing that could explain this behavior, except that running _nodes/stats doesn't show any merge operations executing on the overloaded nodes and there is no indexing throttling.
I have been struggling with this for several days and with no success - I'd appreciate if anyone could shed some light on what could cause these periodic spikes in disk reads.
Thanks,
Dan