Flush threads stuck [ES 6.1.2]

I have run into an issue a few times now where a node is not receiving any requests but is using ~50% CPU indefinitely.
Looking at GET /_cat/thread_pool?v, I see 1 or more active flush thread. This node seems to have 4 stuck in flush:

node_name name                active queue rejected
node-b    flush                    4     0        0

The output of hot_threads corroborates this:

   98.0% (489.9ms out of 500ms) cpu usage by thread 'elasticsearch[node-b][flush][T#2]'
   97.8% (489.1ms out of 500ms) cpu usage by thread 'elasticsearch[node-b][flush][T#3]'
   97.7% (488.5ms out of 500ms) cpu usage by thread 'elasticsearch[node-b][flush][T#4]'

In these cases, running POST /_flush/synced?pretty gives an error on one or more shards like this:

"my_index": {
    "total": 6,
    "successful": 4,
    "failed": 2,
    "failures": [
        "shard": 1,
        "reason": "[1] ongoing operations on primary"

Restarting the node fixes the problem, but I'd like to understand how it gets into this state, and how I might keep this from happening in the first place.

There is certainly a lot of shard migration happening on this cluster, so perhaps that has something to do with it.

Is there anything else I should look at to diagnose this, or anything else I can do to rectify the situation other than bouncing ES?


not a hundred percent sure yet, you might be hitting this one (fixed in 6.1.3): https://github.com/elastic/elasticsearch/pull/28350


1 Like

@loren Could you please provide us a shard-level stats? This can be retrieved via GET /_stats?level=shards. You can post it here or send it to me at nhat.nguyen@elastic.co. Thank you!

That looks awfully similar to the behavior we're seeing. Upgrading right now. I'll report back with shard-level stats if I see it again. Thank you for the heads up!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.