Write queue continue to rise

ITzhangqiang · January 3, 2020, 5:28am

One es node write queue continue to rise from about 08:00。But I set thread_pool.write.queue_size: 2000，why this node's write queue more than 2000?
And why queue Sudden rise?I check bulk request not much changed as usual。

HenningAndersen · January 3, 2020, 11:51am

Hi @ITzhangqiang,

first of all, which version of Elasticsearch are you running?

It looks like all indexing halted and is queueing up instead, might be a deadlock. Seeing a jstack output from the node as well as the output of _nodes/stats, _nodes/hot_threads and _tasks should help reveal the cause.

Armin_Braun · January 3, 2020, 12:17pm

Linking your other thread here @ITzhangqiang

I think these two things are connected. As I said there and @HenningAndersen here, we'll need the full jstack output to track down the dead-lock.

ITzhangqiang · January 3, 2020, 3:28pm

Thanks for your reply,
Finally,I upload related documents on github，url：

@Armin_Braun
@HenningAndersen

HenningAndersen · January 3, 2020, 4:40pm

@ITzhangqiang,

did the queuing of write requests resolve itself? I do not see any write requests queuing up in the _nodes/stats output. Which node had the issue?

nhat · January 3, 2020, 5:23pm

@ITzhangqiang Can you please turn on and share the trace logs of IndexWriter?

PUT /_cluster/settings
{
  "transient": {
    "org.elasticsearch.index.engine.Engine.BD": "trace"
  }
}

ITzhangqiang · January 4, 2020, 12:58am

data-02 had the issue"write queue rise".At that time ,I check the data-02 jstack file and jmap dump file ,did't find anything unusual. So I have to restart the data-02，after restart，the data-02 write queue is down.
I kept the data-02 jstack file in the issued time and upload it on git now,you can have a look.
Thanks

nhat · January 4, 2020, 2:35am

@ITzhangqiang Not sure if you missed my message above. It would great if you can turn on and share trace logs from IndexWriter of data-02. Thank you!

ITzhangqiang · January 4, 2020, 4:07am

sorry,I just see your message,but data-02 is restarted,write queue is down.
I turn on the trace and upload logs .
path: test_Elasticsearch7.4/blob/master/logs-from-es-data-in-es-data-2.txt
@nhat

ITzhangqiang · January 4, 2020, 4:54am

Another problem：data-06 refresh queue is abnorma from yesterday ，and now data-06 has been removed and added again and again by master（seems data-06 is hung，and can't response master's publication of cluster state in time ）,I don't konw how to break this situation except by rebooting.

github.com

ITzhangqiang/test_Elasticsearch7.4/blob/master/data-06-remove-add-log

[2020-01-04T04:34:20,567][INFO ][o.e.c.s.MasterService    ] [es-master-1] node-left[{es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true} lagging], term: 26, version: 470148, reason: removed {{es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true},}
[2020-01-04T04:34:20,997][INFO ][o.e.c.s.ClusterApplierService] [es-master-1] removed {{es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true},}, term: 26, version: 470148, reason: Publication{term=26, version=470148}
[2020-01-04T04:34:22,731][INFO ][o.e.c.s.MasterService    ] [es-master-1] node-join[{es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 26, version: 470149, reason: added {{es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true},}
[2020-01-04T04:34:32,751][INFO ][o.e.c.c.C.CoordinatorPublication] [es-master-1] after [10s] publication of cluster state version [470149] is still waiting for {es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true} [SENT_APPLY_COMMIT]
[2020-01-04T04:34:52,751][INFO ][o.e.c.s.ClusterApplierService] [es-master-1] added {{es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true},}, term: 26, version: 470149, reason: Publication{term=26, version=470149}
[2020-01-04T04:34:52,757][WARN ][o.e.c.c.C.CoordinatorPublication] [es-master-1] after [30s] publication of cluster state version [470149] is still waiting for {es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true} [SENT_APPLY_COMMIT]
[2020-01-04T04:35:02,801][INFO ][o.e.c.c.C.CoordinatorPublication] [es-master-1] after [10s] publication of cluster state version [470150] is still waiting for {es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true} [SENT_APPLY_COMMIT]
[2020-01-04T04:35:22,804][WARN ][o.e.c.c.C.CoordinatorPublication] [es-master-1] after [30s] publication of cluster state version [470150] is still waiting for {es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true} [SENT_APPLY_COMMIT]
[2020-01-04T04:35:32,836][INFO ][o.e.c.c.C.CoordinatorPublication] [es-master-1] after [10s] publication of cluster state version [470151] is still waiting for {es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true} [SENT_APPLY_COMMIT]
[2020-01-04T04:35:52,844][WARN ][o.e.c.c.C.CoordinatorPublication] [es-master-1] after [30s] publication of cluster state version [470151] is still waiting for {es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true} [SENT_APPLY_COMMIT]
[2020-01-04T04:36:22,758][WARN ][o.e.c.c.LagDetector      ] [es-master-1] node [{es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true}] is lagging at cluster state version [0], although publication of cluster state version [470149] completed [1.5m] ago
[2020-01-04T04:36:22,801][INFO ][o.e.c.s.MasterService    ] [es-master-1] node-left[{es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true} lagging], term: 26, version: 470152, reason: removed {{es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true},}
[2020-01-04T04:36:23,586][INFO ][o.e.c.s.ClusterApplierService] [es-master-1] removed {{es-data-6}{uGjZU_TDRwuehBS4O6lo4Q}{EsFFGZ6ARNmkrqSYoBv-7Q}{xxxxxx..116}{xxxxxx..116:9300}{dil}{ml.machine_memory=134641295360, ml.max_open_jobs=20, xpack.installed=true},}, term: 26, version: 470152, reason: Publication{term=26, version=470152}
[2020-01-04T04:36:24,968][INFO ][o.e.c.s.MasterService    ] [es-master-1

This file has been truncated. show original

nhat · January 4, 2020, 5:26am

Thanks @ITzhangqiang. Sadly, the problematic node was restarted. Can you turn off the trace log and turn it on again when the problem reoccurs?

PUT /_cluster/settings
{
  "transient": {
    "org.elasticsearch.index.engine.Engine.BD": null
  }
}

nhat · January 4, 2020, 5:28am

@ITzhangqiang Can you also share ES log from data-06? Thank you!

ITzhangqiang · January 4, 2020, 5:47am

@nhat Thankyou for your reply，I upload data-06 log .
https://github.com/ITzhangqiang/test_Elasticsearch7.4/blob/master/logs-from-es-data-in-es-data-6%20(1).rar

ITzhangqiang · January 4, 2020, 5:49am

no problem.

Armin_Braun · January 5, 2020, 2:29pm

I wonder since all these problems appear to turn up on data-6 and the stack traces look like data-6 is simply taking forever on one thread that is Runnable doing work while some other thread is blocked on the lock held by the working thread, could it be that this is rather a system problem
with data node 6 that shows up via ES here?
It's somewhat suspicious that first a write thread gets seemingly stuck on doing work and now the CS applier state seems to have locked up.

@ITzhangqiang can you share some details on how data-06 is deployed? Is it running on some virtualization that is doing something along the lines of throttling CPU or overcommitting CPU across multiple VMs or so?

ITzhangqiang · January 5, 2020, 3:02pm

@Armin_Braun
Thankyou for your reply
My es7.4 cluster deployed on Kubernetes。One es instance deployed on one real physical machines。
By the way，today I restart some issued node，and cluster becomed green。But after several hours，problems again happen，look：

I will upload other logs on git https://github.com/ITzhangqiang/test_Elasticsearch7.4/tree/master/2020-01-05

ITzhangqiang · January 6, 2020, 2:10am

@nhat ，hi，problem reoccurs，you can look es-data-6 log （ "logger.org.elasticsearch.index.engine.Engine.BD": "trace"） on

github.com

ITzhangqiang/test_Elasticsearch7.4/blob/master/2020-01-05/es-data-06-monitor-picture/logs-from-es-data-in-es-data-6 (4).txt

[2020-01-05T08:58:52,862][WARN ][o.e.c.l.LogConfigurator  ] [es-data-6] Some logging configurations have %marker but don't have %node_name. We will automatically add %node_name to the pattern to ease the migration for users who customize log4j2.properties but will stop this behavior in 7.0. You should manually replace `%node_name` with `[%node_name]%marker ` in these locations:
  /usr/share/elasticsearch/config/log4j2.properties
[2020-01-05T08:58:54,146][INFO ][o.e.e.NodeEnvironment    ] [es-data-6] using [12] data paths, mounts [[/usr/share/elasticsearch/data/data7 (/dev/sdj), /usr/share/elasticsearch/data/data2 (/dev/sde), /usr/share/elasticsearch/data/data3 (/dev/sdf), /usr/share/elasticsearch/data/data6 (/dev/sdi), /usr/share/elasticsearch/data/data8 (/dev/sdk), /usr/share/elasticsearch/data/data10 (/dev/sdm), /usr/share/elasticsearch/data/data11 (/dev/sdn), /usr/share/elasticsearch/data/data9 (/dev/sdl), /usr/share/elasticsearch/data/data0 (/dev/sdc), /usr/share/elasticsearch/data/data1 (/dev/sdd), /usr/share/elasticsearch/data/data4 (/dev/sdg), /usr/share/elasticsearch/data/data5 (/dev/sdh)]], net usable_space [80tb], net total_space [86.6tb], types [ext4]
[2020-01-05T08:58:54,146][INFO ][o.e.e.NodeEnvironment    ] [es-data-6] heap size [31gb], compressed ordinary object pointers [true]
[2020-01-05T08:59:00,868][INFO ][o.e.n.Node               ] [es-data-6] node name [es-data-6], node ID [uGjZU_TDRwuehBS4O6lo4Q], cluster name [txtg-es-cluster]
[2020-01-05T08:59:00,869][INFO ][o.e.n.Node               ] [es-data-6] version[7.4.0], pid[1], build[default/docker/22e1767283e61a198cb4db791ea66e3f11ab9910/2019-09-27T08:36:48.569419Z], OS[Linux/3.10.0-862.el7.x86_64/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/13/13+33]
[2020-01-05T08:59:00,869][INFO ][o.e.n.Node               ] [es-data-6] JVM home [/usr/share/elasticsearch/jdk]
[2020-01-05T08:59:00,869][INFO ][o.e.n.Node               ] [es-data-6] JVM arguments [-Xms31g, -Xmx31g, -XX:+UseG1GC, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -javaagent:/usr/share/elasticsearch/jmx/jmx_prometheus_javaagent-0.12.0.jar=9201:/usr/share/elasticsearch/jmx/jmx.yaml, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Djava.locale.providers=COMPAT, -Des.cgroups.hierarchy.override=/, -Dio.netty.allocator.type=pooled, -XX:MaxDirectMemorySize=16642998272, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config, -Des.distribution.flavor=default, -Des.distribution.type=docker, -Des.bundled_jdk=true]
[2020-01-05T08:59:03,773][INFO ][o.e.p.PluginsService     ] [es-data-6] loaded module [aggs-matrix-stats]
[2020-01-05T08:59:03,774][INFO ][o.e.p.PluginsService     ] [es-data-6] loaded module [analysis-common]
[2020-01-05T08:59:03,774][INFO ][o.e.p.PluginsService     ] [es-data-6] loaded module [data-frame]
[2020-01-05T08:59:03,774][INFO ][o.e.p.PluginsService     ] [es-data-6] loaded module [flattened]
[2020-01-05T08:59:03,774][INFO ][o.e.p.PluginsService     ] [es-data-6] loaded module [frozen-indices]
[2020-01-05T08:59:03,774][INFO ][o.e.p.PluginsService     ] [es-data-6] loaded module [ingest-common]
[2020-01-05T08:59:03,774][INFO ][o.e.p.PluginsService     ] [es-data-6] loaded module [ingest-geoip]
[2020-01-05T08:59:03,774][INFO ][o.e.p.PluginsService     ] [es-data-6] loaded module [ingest-user-agent]
[2020-01-05T08:59:03,774][INFO ][o.e.p.PluginsService     ] [es-data-6] loaded module [lang-expression]
[2020-01-05T08:59:03,774][INFO ][o.e.p.PluginsService     ] [es-data-6] loaded module [lang-mustache]
[2020-01-05T08:59:03,774][INFO ][o.e.p.PluginsService     ] [es-data-6] loaded module [lang-painless]
[2020-01-05T08:59:03,774][INFO ][o.e.p.PluginsService     ] [es-data-6] loaded module [mapper-extras]

This file has been truncated. show original

nhat · January 6, 2020, 2:50am

Thanks @ITzhangqiang. I am looking at the log.

ITzhangqiang · January 6, 2020, 9:56am

hi,is there any new development on this issue?
I have to recover my cluster,so maybe I must restart some node. Do you need any other information about those issued nodes before I restart. I'd be happy to supply it.
Thanks!
@nhat

nhat · January 6, 2020, 1:43pm

Hi @ITzhangqiang, not yet! Can you post the output of GET /_stats?level=shards, GET /_tasks and logs from the problematic nodes. Thank you.

Topic		Replies	Views
Index writer memory Continue to rise Elasticsearch	22	2432	February 12, 2020
Elasticsearch Thread Pool Details Elasticsearch elastic-stack-monitoring	1	272	November 12, 2020
ES6.1 Heap memory used by the index writer Elasticsearch	9	2539	March 5, 2018
Increase in size of refresh queue size, increasing the write thread pool and queue on multiple data nodes Elasticsearch	6	620	January 16, 2023
Index repeatedly gets "stuck" with high index writer memory usage Elasticsearch	4	549	March 26, 2021

Write queue continue to rise

Related topics