Running ES 7.11.0 on Google Kubernetes Engine via ECK 1.4.0. Three-node cluster with each VM having 4 vCPUs, 16GB RAM, 3TB persistent disk.
I am loading data into several time-based indices. Most of them are fine, but one particular type has a recurring problem that has recently developed: in the early hours of the morning every day, a few hours after the daily index has been created, the index writer memory usage gets very high on only one node. After some time, the index then seems to get completely stuck and will not load more data. Other indices in the cluster are unaffected and continue to work normally.
A screenshot of Index Memory usage over the last few days is attached: you can see the spike (which is followed by a hang) on each day.
So far, the problem has always been fixed by restarting the cluster once I wake up, and then allowing the broken index to recover. Each time during recovery, the affected node seems to halt somewhere in the translog portion of the recovery but, eventually, it gets past it and everything goes back to normal until the problem reoccurs the next day, on a new index.
The index in question has a refresh interval of 600s, but dropping it to 60s didn't seem to make the problem go away. I'm also running with indices.memory.index_buffer_size = 2gb
, but allowing it to drop to the default of 10% heap size also didn't seem to fix things.
Otherwise I think I have a pretty vanilla set-up, with sysctl -w vm.max_map_count=262144
on all nodes and a heap size -Xms8g -Xmx8g
. Except: I am lying to Elasticsearch about the number of CPUs I have -- setting node.processors = 64
when I only actually have 4. The reason for this is that Google's standard persistent disk requires high I/O parallelism to get good performance, and the only way I could see to persuade Elasticsearch to run lots of I/O threads was to lie about the number of CPUs available. This mode of operation does significantly increases the maximum rate at which I can load data into the cluster, whilst still using the cheaper standard persistent disks.
Is the high number of threads likely to be causing this problem? If so, do I need to rework my cluster such that the indexing happens on "hot" nodes with SSDs, with the rollover API moving the closed indices to "warm" nodes before merging shards? If not, is there anything I can do to debug the stuck index problem?
Thanks in advance for any help or suggestions!