GC can't decrease heap memory usage and Elastic fails #44312

Arman_Ajdani · July 13, 2019, 2:21pm

I use elastic 6.2.4 with a 2 node cluster.
when elastic index rate increases at specific time at day heap memory exceed th 70 % and GC starts but heap memory usage won't decrease untill goes up to 97-98% and elastic won't respond and fails after minutes.
1- both servers has 64 GB memory and bot elastic nodes have 32 GB memory allocation.
2- the elastic failure happens every day at a specific time (it's weird due becasue it's not maximum load)
3- load in failure time is 70-80 record index per sec and about 50-60 search per sec
4- always the master node fails
5- if needed I can post jvm and elastic.yml files and logs.

Christian_Dahlqvist · July 13, 2019, 2:26pm

Heap should be somewhere belo 30GB generally. 32GB is too large as you will not benefit from compressed pointers.

What is the full output of the cluster stats API around the time when heap usage approaches 70%?

Do you have any non-standard configuration in place?

How are you indexing into the cluster? What type of queries are you running? What kind of hardware is the cluster deployed on?

dadoonet · July 13, 2019, 2:33pm

What is the output of:

GET /
GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v
# Optionally 
GET /_cat/shards?v

If some outputs are too big, please share them on gist.github.com and link them here.

Arman_Ajdani · July 14, 2019, 9:50am

[14/07/2019 11:21 AM] Arman Ajdani: [2019-07-14T06:49:32,420][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85570] overhead, spent [429ms] collecting in the last [1s]
[2019-07-14T06:49:34,422][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85572] overhead, spent [304ms] collecting in the last [1s]
[2019-07-14T06:49:52,546][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85590] overhead, spent [482ms] collecting in the last [1.1s]
[2019-07-14T06:49:53,742][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85591] overhead, spent [477ms] collecting in the last [1.1s]
[2019-07-14T06:50:12,747][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85610] overhead, spent [432ms] collecting in the last [1s]
[2019-07-14T06:50:13,900][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85611] overhead, spent [435ms] collecting in the last [1.1s]
[2019-07-14T06:50:14,900][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85612] overhead, spent [280ms] collecting in the last [1s]
[2019-07-14T06:50:32,032][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85629] overhead, spent [516ms] collecting in the last [1.1s]
[2019-07-14T06:50:33,206][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85630] overhead, spent [447ms] collecting in the last [1.1s]
[2019-07-14T06:50:34,206][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85631] overhead, spent [371ms] collecting in the last [1s]
[2019-07-14T06:50:53,209][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85650] overhead, spent [324ms] collecting in the last [1s]
[2019-07-14T06:50:54,210][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85651] overhead, spent [417ms] collecting in the last [1s]
[2019-07-14T06:50:55,210][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85652] overhead, spent [309ms] collecting in the last [1s]

[14/07/2019 11:23 AM] Arman Ajdani: [2019-07-14T06:52:41,986][WARN ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][old][85731][302] duration [27.3s], collections [2]/[28.3s], total [27.3s]/[45.4s], memory [22.8gb]->[19.2gb]/[23.9gb], all_pools {[young] [45mb]->[122.5mb]/[665.6mb]}{[survivor] [83.1mb]->[0b]/[83.1mb]}{[old] [22.6gb]->[19.1gb]/[23.1gb]}

this is the log when fails happens

Arman_Ajdani · July 14, 2019, 9:50am

ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
1 40 99 22 5.31 3.79 3.77 mdi * 3aY8vv0
2 16 99 4 3.17 3.21 3.27 mdi - fSX3XcC

Arman_Ajdani · July 14, 2019, 9:51am

shards and indiceis are too long because I have about 8000 shards on this two nodes

dadoonet · July 14, 2019, 10:02am

You probably have too many shards per node.

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

And https://www.elastic.co/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right

Arman_Ajdani · July 14, 2019, 10:08am

Yeah but alotof shards are empty and alotof them are for metricbeat and logstash.

dadoonet · July 14, 2019, 3:34pm

8000 shards on 2 nodes mean that you might need something like 200gb HEAP to manage them.

Christian_Dahlqvist · July 14, 2019, 4:29pm

Shards take up heap space even if they are almost empty, so the count does matter, not least because it increases the size of the cluster state. This old blog post has an interesting, although somewhat extreme example.

Arman_Ajdani · July 15, 2019, 7:36am

I have delete some indices and now about 3000 shards on every server, Is thsi okay or still too much shards on every node?

Christian_Dahlqvist · July 15, 2019, 9:34am

That still sounds like a lot.

system · August 12, 2019, 9:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
35 shards but maxing out JVM heap Elasticsearch	12	4345	April 5, 2018
Elasticsearch heap issues Elasticsearch	4	472	July 5, 2017
How to optimise heap usage on elasticsearch nodes? Elasticsearch	9	758	November 10, 2020
Long running GC, cluster status RED, only few GB's data Elasticsearch	12	2593	July 5, 2017
ES2.0.2 - Heap near 100% and eventually Elasticsearch locks up Elasticsearch	8	1224	January 4, 2018

GC can't decrease heap memory usage and Elastic fails #44312

Related topics