High read I/O and Load Average after upgrading to Elasticsearch 5.3.2

We upgraded our Elasticsearch to 5.3.2 from 1.7.4 after a long process of migration.
A few hours after the upgrade, we started to have performance issues in our app.
We saw that 1 or 2 nodes (out of 12 data nodes) suffer from high read I/O and high Load Average (caused by the wait for disk reads). Restating the node (we have replicas) temporarily solving the issue in the specific node, but the problem just moves to another node.
We investigated merges and checked if it might be caused by a hot spot (that's what we found over the internet that might be related to our issue) but we didn't find anything that might shed some light on this.

We are running 12 c3.4xlarge data instances on AWS, using instance store as disk (~80G is used out of 320GB on each node), 30G ram (15GB RAM for Elasticsearch Heap) and 16 cores.
In addition, we are running 3 m4.large master nodes.
We have 1 replica for each shard, 45 indexes , total size including replicas is 800G.

IO Top -

Load Average (12 hours ago) -

1 Like


I'd start exploring the nodes stats and look for significant differences in e.g. merge times on affected and unaffected nodes. You can also issue several calls to the nodes hot threads API to capture thread dumps. This should reveal what is actually going on on the affected nodes.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.