Index writer memory Continue to rise

ITzhangqiang · January 2, 2020, 6:36am

hi，all：
I have a question that has puzzled me for a long time。
One node in the cluster has too much index writer memory (other node is fline), and it keeps going up,and it lead to index throttling .

And then I adjusted index.refresh.interval from 30s to 10s , but
the situation has not improved much。
I also found this issued node's refresh queue has a large back-up，

why index writer cost so much memory?I found nothing in log
As index writer memory continues to rise，in the end will lead to the exccption: “org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [33137445694/30.8gb], which is larger than the limit of [31621696716/29.4gb]”

elasticsearch version ：7.4.0
jvm heap：31G

could you help me analyze this problem ?@DavidTurner

Armin_Braun · January 2, 2020, 7:31am

Hi @ITzhangqiang

Are you triggering refreshes manually at a high rate somehow? (it certainly looks like it).
Maybe those refreshes only trigger shards residing on es-data-8 causing it to become overloaded and unable to flush its index writers.

Can you check the (tasks API)[https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html#tasks-api-examples] for es-data-8 and share the results? That should help figure out where those refreshes are coming from I think.

ITzhangqiang · January 2, 2020, 8:07am

Thank you for your reply，
I don't refresh index manually,I use logstash transport data from kafka to elasticsearch.
and I use logstash double writer data to es6.3 cluster and es7.4 cluster, es6.3 cluster never happend this problem。
Because of es-data-8 node's memory reached 95% of heap size(lead to CircuitBreakingException )，so task API can't return correct result .(Execption：[circuit_breaking_exception] [parent] Data too large, data for [<http_request>] would be [32273530784/30gb], which is larger than the limit of [31621696716/29.4gb])

I use task API try many time，return bellow result（i think it is incomplete，previous it can return many refresh task ）

Another information：I check shard on data8, most index shard is evenly distributed to all nodes，so index refresh will happen in all node

Armin_Braun · January 2, 2020, 11:22am

Thanks for the tasks list @ITzhangqiang

it looks like the refresh is stuck/dead-locked somehow (it's running for 16h+ already!). That seems to be the problem. Can you take a thread-dump on data8, so we can start tracking down why/where it dead-locked maybe?

Thanks!

ITzhangqiang · January 2, 2020, 11:46am

unfortunately, I deployed the cluster on Kubernetes，in docker container，i cat run jstack。

Do you know how to set it up

Armin_Braun · January 2, 2020, 11:53am

@ITzhangqiang

it should work fine using nsenter into the Docker container. The easiest way of doing that, that I know of is https://github.com/jpetazzo/nsenter (unless you have nsenter properly working already and know how to do it :)) That should allow you to jstack just fine without permission issues.

ITzhangqiang · January 2, 2020, 12:12pm

thank you very much,I already restart es-data-8,and cluster's health is recoverd green.so thread dump can't tracking the issue. but i think this problem will happen again. if happen ,I will notice you.
Thank you once again!

ITzhangqiang · January 3, 2020, 8:54am

Hi， another node happen this problem

I check the jstack file ，found：locked <0x000000128539b510>

@Armin_Braun

Armin_Braun · January 3, 2020, 9:41am

@ITzhangqiang can you provide the full jstack file here so I can take a look?

ITzhangqiang · January 3, 2020, 9:59am

Thankyou for your reply：

I run jstack twice，all found same blocked thread。

I found many node happen this problem。

在 2020-01-03 17:51:23，"Armin Braun via Discuss the Elastic Stack" elastic@discoursemail.com 写道：

(Attachment jstack06_2 is missing)

(Attachment jstack06 is missing)

ITzhangqiang · January 3, 2020, 10:02am

I receive this info ，can't upload file

在 2020-01-03 17:51:23，"Armin Braun via Discuss the Elastic Stack" elastic@discoursemail.com 写道：

Armin_Braun · January 3, 2020, 10:08am

@ITzhangqiang

can you upload the jstack files to e.g. https://pastebin.com maybe and link them here? That should work fine

nhat · January 3, 2020, 5:41pm

I believe those refreshes are queued up by IndexingMemoryController. I will take a closer look.

ITzhangqiang · January 9, 2020, 6:36am

hi,
@nhat
@Armin_Braun
I'm sorry to bother you, That's still the question that elastic search 7.4 always have some node which index writer hold too many memory.I believe this is a big issue. （es 6.3 never find this issue for a year）Help me to find what the reason behind！We must solve this problem！

nhat · January 9, 2020, 1:57pm

@ITzhangqiang No worries, we are here to help. Can you share the shard-level stats of your cluster (GET /_stats?level=shards)?

nhat · January 10, 2020, 2:35am

@ITzhangqiang I have merged https://github.com/elastic/elasticsearch/pull/50769, which should avoid flooding the refresh thread pool.

ITzhangqiang · January 10, 2020, 5:30am

Thank you very much,
@nhat
I found index writer's abnormal mabye related to some special index.
see the index below：

This index cause data-09 node's writer memory abnormal，see data-09 node below：

I check shard on this index ,found one shard hold a big index writer memory:

I upload data-09 jstack file and this abnormal index's shard detail state

nhat · January 11, 2020, 4:49pm

Hi @ITzhangqiang,

"uncommitted_size_in_bytes" : 12605908804

The uncommitted translog should not go above 512MB per shard by default. Did you change any translog setting? Can you share the logs from the node data-09?

One theory that I have is that the throttling does not work well. Can you add
-Des.index.memory.max_index_buffer_size=256mb to config/jvm.options on some nodes then restart them. Please let me know if the problem goes away on those nodes. Thank you.

ITzhangqiang · January 13, 2020, 2:43am

Hi @nhat
Most of index ，I changed translog settings：
"translog" : {
"sync_interval" : "60s",
"durability" : "async",
"flush_threshold_size": "1gb"
}
I think I have found the problem.I checked the abnormal index carefully. Found I manually specify
the "document_id" field,but index data is problematic frequently,lead to millions of "document_id" are same ,so documents keep updating .
(as I see in the jstack file ,most of write thread is waiting “doc_id lock” ：
at org.elasticsearch.index.engine.LiveVersionMap.acquireLock(LiveVersionMap.java:473) at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:856)) .
It's the problematic index that lead to those issue："index writer memory rise","write queue rise".

After I change the problematic index ,everything is back to normal.
Thank you very much for your help during this period.
Thanks again!

nhat · January 13, 2020, 1:43pm

@ITzhangqiang Glad to hear that the problem was solved. You're welcome and thank you for collaborating.

Topic		Replies	Views
Write queue continue to rise Elasticsearch	23	3935	February 4, 2020
Index repeatedly gets "stuck" with high index writer memory usage Elasticsearch	4	612	March 26, 2021
Memory problems during data index Elasticsearch	13	1564	July 6, 2017
ES6.1 Heap memory used by the index writer Elasticsearch	9	2566	March 5, 2018
Elasticsearch configuration for uninterrupted indexing Elasticsearch	6	400	July 6, 2017

Index writer memory Continue to rise

Related topics