Index writer memory Continue to rise

hi,all:
I have a question that has puzzled me for a long time。
One node in the cluster has too much index writer memory (other node is fline), and it keeps going up,and it lead to index throttling .


And then I adjusted index.refresh.interval from 30s to 10s , but
the situation has not improved much。
I also found this issued node's refresh queue has a large back-up,
image
why index writer cost so much memory?I found nothing in log
As index writer memory continues to rise,in the end will lead to the exccption: “org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [33137445694/30.8gb], which is larger than the limit of [31621696716/29.4gb]”

elasticsearch version :7.4.0
jvm heap:31G

could you help me analyze this problem ?@DavidTurner

1 Like

Hi @ITzhangqiang

Are you triggering refreshes manually at a high rate somehow? (it certainly looks like it).
Maybe those refreshes only trigger shards residing on es-data-8 causing it to become overloaded and unable to flush its index writers.

Can you check the (tasks API)[https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html#tasks-api-examples] for es-data-8 and share the results? That should help figure out where those refreshes are coming from I think.

Thank you for your reply,
I don't refresh index manually,I use logstash transport data from kafka to elasticsearch.
and I use logstash double writer data to es6.3 cluster and es7.4 cluster, es6.3 cluster never happend this problem。
Because of es-data-8 node's memory reached 95% of heap size(lead to CircuitBreakingException ),so task API can't return correct result .(Execption:[circuit_breaking_exception] [parent] Data too large, data for [<http_request>] would be [32273530784/30gb], which is larger than the limit of [31621696716/29.4gb])

I use task API try many time,return bellow result(i think it is incomplete,previous it can return many refresh task )

Another information:I check shard on data8, most index shard is evenly distributed to all nodes,so index refresh will happen in all node

Thanks for the tasks list @ITzhangqiang

it looks like the refresh is stuck/dead-locked somehow (it's running for 16h+ already!). That seems to be the problem. Can you take a thread-dump on data8, so we can start tracking down why/where it dead-locked maybe?

Thanks!

unfortunately, I deployed the cluster on Kubernetes,in docker container,i cat run jstack。 :sob:


Do you know how to set it up

@ITzhangqiang

it should work fine using nsenter into the Docker container. The easiest way of doing that, that I know of is https://github.com/jpetazzo/nsenter (unless you have nsenter properly working already and know how to do it :)) That should allow you to jstack just fine without permission issues.

thank you very much,I already restart es-data-8,and cluster's health is recoverd green.so thread dump can't tracking the issue. but i think this problem will happen again. if happen ,I will notice you.
Thank you once again!

Hi, another node happen this problem
image

I check the jstack file ,found:locked <0x000000128539b510>


@Armin_Braun

@ITzhangqiang can you provide the full jstack file here so I can take a look?

Thankyou for your reply:

I run jstack twice,all found same blocked thread。

I found many node happen this problem。

在 2020-01-03 17:51:23,"Armin Braun via Discuss the Elastic Stack" elastic@discoursemail.com 写道:

image.png

(Attachment jstack06_2 is missing)

(Attachment jstack06 is missing)

I receive this info ,can't upload file

在 2020-01-03 17:51:23,"Armin Braun via Discuss the Elastic Stack" elastic@discoursemail.com 写道:

@ITzhangqiang

can you upload the jstack files to e.g. https://pastebin.com maybe and link them here? That should work fine :slight_smile:

I believe those refreshes are queued up by IndexingMemoryController. I will take a closer look.

1 Like

hi,
@nhat
@Armin_Braun
I'm sorry to bother you, That's still the question that elastic search 7.4 always have some node which index writer hold too many memory.I believe this is a big issue. (es 6.3 never find this issue for a year)Help me to find what the reason behind!We must solve this problem!

@ITzhangqiang No worries, we are here to help. Can you share the shard-level stats of your cluster (GET /_stats?level=shards)?

@ITzhangqiang I have merged https://github.com/elastic/elasticsearch/pull/50769, which should avoid flooding the refresh thread pool.

Thank you very much,
@nhat
I found index writer's abnormal mabye related to some special index.
see the index below:

This index cause data-09 node's writer memory abnormal,see data-09 node below:


image

I check shard on this index ,found one shard hold a big index writer memory:


image

I upload data-09 jstack file and this abnormal index's shard detail state

Hi @ITzhangqiang,

"uncommitted_size_in_bytes" : 12605908804

The uncommitted translog should not go above 512MB per shard by default. Did you change any translog setting? Can you share the logs from the node data-09?

One theory that I have is that the throttling does not work well. Can you add
-Des.index.memory.max_index_buffer_size=256mb to config/jvm.options on some nodes then restart them. Please let me know if the problem goes away on those nodes. Thank you.

Hi @nhat
Most of index ,I changed translog settings:
"translog" : {
"sync_interval" : "60s",
"durability" : "async",
"flush_threshold_size": "1gb"
}
I think I have found the problem.I checked the abnormal index carefully. Found I manually specify
the "document_id" field,but index data is problematic frequently,lead to millions of "document_id" are same ,so documents keep updating .
(as I see in the jstack file ,most of write thread is waiting “doc_id lock” :
at org.elasticsearch.index.engine.LiveVersionMap.acquireLock(LiveVersionMap.java:473) at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:856)) .
It's the problematic index that lead to those issue:"index writer memory rise","write queue rise".

After I change the problematic index ,everything is back to normal.
Thank you very much for your help during this period.
Thanks again!

1 Like

@ITzhangqiang Glad to hear that the problem was solved. You're welcome and thank you for collaborating.