We upgraded our Elasticsearch to 5.3.2 from 1.7.4 after a long process of migration.
A few hours after the upgrade, we started to have performance issues in our app.
We saw that 1 or 2 nodes (out of 12 data nodes) suffer from high read I/O and high Load Average (caused by the wait for disk reads). Restating the node (we have replicas) temporarily solving the issue in the specific node, but the problem just moves to another node.
We investigated merges and checked if it might be caused by a hot spot (that's what we found over the internet that might be related to our issue) but we didn't find anything that might shed some light on this.
We are running 12 c3.4xlarge data instances on AWS, using instance store as disk (~80G is used out of 320GB on each node), 30G ram (15GB RAM for Elasticsearch Heap) and 16 cores.
In addition, we are running 3 m4.large master nodes.
We have 1 replica for each shard, 45 indexes , total size including replicas is 800G.
IO Top -
Load Average (12 hours ago) -