1 Node gets stuck with high load and 0% disk idle

Im running a 3 node cluster of ES 6.5.4 in AWS on c5.xlarge nodes with 700 GB drives with 3000 provisioned iops. Our cluster holds almost 1 billion documents and can handle alot of search and index most of the time. Sometimes it gets in a very weird state and I havent been able to figure out what is going on. When this state occurs, we get alot of search latency in our application and in the es cluster. Load is high on the single node but there is no obvious sign of what it is doing. Index and search arent high on that node and neither is CPU. Ive searched tasks and hot threads and nothing obvious is there, though there are often more tasks on the bad node than others(im attaching the task list). No high cpu shown on hot threads. The most telling sign is that in AWS EC2 control panel, I can see that the bad node is at 0% disk idle, while the disks of the other 2 nodes are at 80-90% idle. https://pastebin.com/k6X00NQm https://imgur.com/a/xTBfKhE

sigh the issue is happening again, going to include dumps of stuff
tasks: https://pastebin.com/4CbXw7Cp
hot threads: https://pastebin.com/ywdG7CwJ
_cluster/pending_tasks: none

oh and the bad node is h9zeBGO

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.