Write queue continue to rise

One es node write queue continue to rise from about 08:00。But I set thread_pool.write.queue_size: 2000,why this node's write queue more than 2000?
And why queue Sudden rise?I check bulk request not much changed as usual。

Hi @ITzhangqiang,

first of all, which version of Elasticsearch are you running?

It looks like all indexing halted and is queueing up instead, might be a deadlock. Seeing a jstack output from the node as well as the output of _nodes/stats, _nodes/hot_threads and _tasks should help reveal the cause.

Linking your other thread here @ITzhangqiang

I think these two things are connected. As I said there and @HenningAndersen here, we'll need the full jstack output to track down the dead-lock.

1 Like

Thanks for your reply,
Finally,I upload related documents on github,url:


@Armin_Braun
@HenningAndersen

@ITzhangqiang,

did the queuing of write requests resolve itself? I do not see any write requests queuing up in the _nodes/stats output. Which node had the issue?

@ITzhangqiang Can you please turn on and share the trace logs of IndexWriter?

PUT /_cluster/settings
{
  "transient": {
    "org.elasticsearch.index.engine.Engine.BD": "trace"
  }
}

data-02 had the issue"write queue rise".At that time ,I check the data-02 jstack file and jmap dump file ,did't find anything unusual. So I have to restart the data-02,after restart,the data-02 write queue is down.
I kept the data-02 jstack file in the issued time and upload it on git now,you can have a look.
Thanks

@ITzhangqiang Not sure if you missed my message above. It would great if you can turn on and share trace logs from IndexWriter of data-02. Thank you!

sorry,I just see your message,but data-02 is restarted,write queue is down.
I turn on the trace and upload logs .
path: test_Elasticsearch7.4/blob/master/logs-from-es-data-in-es-data-2.txt
@nhat

Another problem:data-06 refresh queue is abnorma from yesterday ,and now data-06 has been removed and added again and again by master(seems data-06 is hung,and can't response master's publication of cluster state in time ),I don't konw how to break this situation except by rebooting.

Thanks @ITzhangqiang. Sadly, the problematic node was restarted. Can you turn off the trace log and turn it on again when the problem reoccurs?

PUT /_cluster/settings
{
  "transient": {
    "org.elasticsearch.index.engine.Engine.BD": null
  }
}

@ITzhangqiang Can you also share ES log from data-06? Thank you!

@nhat Thankyou for your reply,I upload data-06 log .
https://github.com/ITzhangqiang/test_Elasticsearch7.4/blob/master/logs-from-es-data-in-es-data-6%20(1).rar

no problem.

I wonder since all these problems appear to turn up on data-6 and the stack traces look like data-6 is simply taking forever on one thread that is Runnable doing work while some other thread is blocked on the lock held by the working thread, could it be that this is rather a system problem
with data node 6 that shows up via ES here?
It's somewhat suspicious that first a write thread gets seemingly stuck on doing work and now the CS applier state seems to have locked up.

@ITzhangqiang can you share some details on how data-06 is deployed? Is it running on some virtualization that is doing something along the lines of throttling CPU or overcommitting CPU across multiple VMs or so?

@Armin_Braun
Thankyou for your reply
My es7.4 cluster deployed on Kubernetes。One es instance deployed on one real physical machines。
By the way,today I restart some issued node,and cluster becomed green。But after several hours,problems again happen,look:


I will upload other logs on git https://github.com/ITzhangqiang/test_Elasticsearch7.4/tree/master/2020-01-05

@nhat ,hi,problem reoccurs,you can look es-data-6 log ( "logger.org.elasticsearch.index.engine.Engine.BD": "trace") on

Thanks @ITzhangqiang. I am looking at the log.

hi,is there any new development on this issue?
I have to recover my cluster,so maybe I must restart some node. Do you need any other information about those issued nodes before I restart. I'd be happy to supply it.
Thanks!
@nhat

Hi @ITzhangqiang, not yet! Can you post the output of GET /_stats?level=shards, GET /_tasks and logs from the problematic nodes. Thank you.