I have run into an issue a few times now where a node is not receiving any requests but is using ~50% CPU indefinitely.
Looking at GET /_cat/thread_pool?v
, I see 1 or more active flush thread. This node seems to have 4 stuck in flush:
node_name name active queue rejected
node-b flush 4 0 0
The output of hot_threads
corroborates this:
98.0% (489.9ms out of 500ms) cpu usage by thread 'elasticsearch[node-b][flush][T#2]'
97.8% (489.1ms out of 500ms) cpu usage by thread 'elasticsearch[node-b][flush][T#3]'
97.7% (488.5ms out of 500ms) cpu usage by thread 'elasticsearch[node-b][flush][T#4]'
In these cases, running POST /_flush/synced?pretty
gives an error on one or more shards like this:
"my_index": {
"total": 6,
"successful": 4,
"failed": 2,
"failures": [
{
"shard": 1,
"reason": "[1] ongoing operations on primary"
}
]
}
Restarting the node fixes the problem, but I'd like to understand how it gets into this state, and how I might keep this from happening in the first place.
There is certainly a lot of shard migration happening on this cluster, so perhaps that has something to do with it.
Is there anything else I should look at to diagnose this, or anything else I can do to rectify the situation other than bouncing ES?