Index repeatedly gets "stuck" with high index writer memory usage

Running ES 7.11.0 on Google Kubernetes Engine via ECK 1.4.0. Three-node cluster with each VM having 4 vCPUs, 16GB RAM, 3TB persistent disk.

I am loading data into several time-based indices. Most of them are fine, but one particular type has a recurring problem that has recently developed: in the early hours of the morning every day, a few hours after the daily index has been created, the index writer memory usage gets very high on only one node. After some time, the index then seems to get completely stuck and will not load more data. Other indices in the cluster are unaffected and continue to work normally.

A screenshot of Index Memory usage over the last few days is attached: you can see the spike (which is followed by a hang) on each day.

So far, the problem has always been fixed by restarting the cluster once I wake up, and then allowing the broken index to recover. Each time during recovery, the affected node seems to halt somewhere in the translog portion of the recovery but, eventually, it gets past it and everything goes back to normal until the problem reoccurs the next day, on a new index.

The index in question has a refresh interval of 600s, but dropping it to 60s didn't seem to make the problem go away. I'm also running with indices.memory.index_buffer_size = 2gb, but allowing it to drop to the default of 10% heap size also didn't seem to fix things.

Otherwise I think I have a pretty vanilla set-up, with sysctl -w vm.max_map_count=262144 on all nodes and a heap size -Xms8g -Xmx8g. Except: I am lying to Elasticsearch about the number of CPUs I have -- setting node.processors = 64 when I only actually have 4. The reason for this is that Google's standard persistent disk requires high I/O parallelism to get good performance, and the only way I could see to persuade Elasticsearch to run lots of I/O threads was to lie about the number of CPUs available. This mode of operation does significantly increases the maximum rate at which I can load data into the cluster, whilst still using the cheaper standard persistent disks.

Is the high number of threads likely to be causing this problem? If so, do I need to rework my cluster such that the indexing happens on "hot" nodes with SSDs, with the rollover API moving the closed indices to "warm" nodes before merging shards? If not, is there anything I can do to debug the stuck index problem?

Thanks in advance for any help or suggestions!

Welcome to our community! :smiley:

What does hot threads from the relevant nodes show at this time? Same with the logs.

Thanks! I was pleased to find this place.

The issue has not reoccurred today. I checked out the hot nodes on the cluster anyway and nothing stood out. I did sometimes see Lucene Merge Threads using ~60% CPU in, but I have the best_compression codec enabled so that probably seems reasonable.

Looking back through the logs and I think I possibly found the culprit:

2021-02-25 06:25:17.119 GMT now throttling indexing for shard [[rep-20210225-poker][1]]: segment writing can't keep up
2021-02-25 09:35:52.521 GMT stop throttling indexing for shard [[rep-20210225-poker][1]]

The index writer memory usage on that node during that time period looks like this:

So even after the throttle has apparently cleared around 0925, the index memory remains at a high level and does not appear to drop until I restart the node.

Is there any setting I could try tuning here, or do I just need more I/O capacity? Might more shards on this index help?

You could try more shards, but if it persists then it's an IO issue.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.