We are running Elasticsearch 7.12.0 and have recently seen system load and read operations suddenly increase significantly on one data node despite there being no change to the number of query/index operations being executed compared to previous hours, days. There is no corresponding increase in CPU.
The screenshot shows the normal search and index operations on the cluster.
The system load on our data nodes under normal circumstances hovers between 2 and 3. This increase in load to > 10 results in indexing slowing down to a crawl i.e down to < 10/s. The system load stays high for a few hours. When eventually it drops, the indexing rate returns to normal levels.
Does anyone have any ideas as to what can cause this sudden increase in load and read operations?
What tools are available to us to:
- Understand what exactly is causing the increase in read operations and load
- Take action to return the load to normal levels as quickly as possible without compromising the data on the node