GC can't decrease heap memory usage and Elastic fails #44312

I use elastic 6.2.4 with a 2 node cluster.
when elastic index rate increases at specific time at day heap memory exceed th 70 % and GC starts but heap memory usage won't decrease untill goes up to 97-98% and elastic won't respond and fails after minutes.
1- both servers has 64 GB memory and bot elastic nodes have 32 GB memory allocation.
2- the elastic failure happens every day at a specific time (it's weird due becasue it's not maximum load)
3- load in failure time is 70-80 record index per sec and about 50-60 search per sec
4- always the master node fails
5- if needed I can post jvm and elastic.yml files and logs.

Heap should be somewhere belo 30GB generally. 32GB is too large as you will not benefit from compressed pointers.

What is the full output of the cluster stats API around the time when heap usage approaches 70%?

Do you have any non-standard configuration in place?

How are you indexing into the cluster? What type of queries are you running? What kind of hardware is the cluster deployed on?

What is the output of:

GET /
GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v
# Optionally 
GET /_cat/shards?v

If some outputs are too big, please share them on gist.github.com and link them here.

[14/07/2019 11:21 AM] Arman Ajdani: [2019-07-14T06:49:32,420][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85570] overhead, spent [429ms] collecting in the last [1s]
[2019-07-14T06:49:34,422][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85572] overhead, spent [304ms] collecting in the last [1s]
[2019-07-14T06:49:52,546][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85590] overhead, spent [482ms] collecting in the last [1.1s]
[2019-07-14T06:49:53,742][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85591] overhead, spent [477ms] collecting in the last [1.1s]
[2019-07-14T06:50:12,747][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85610] overhead, spent [432ms] collecting in the last [1s]
[2019-07-14T06:50:13,900][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85611] overhead, spent [435ms] collecting in the last [1.1s]
[2019-07-14T06:50:14,900][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85612] overhead, spent [280ms] collecting in the last [1s]
[2019-07-14T06:50:32,032][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85629] overhead, spent [516ms] collecting in the last [1.1s]
[2019-07-14T06:50:33,206][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85630] overhead, spent [447ms] collecting in the last [1.1s]
[2019-07-14T06:50:34,206][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85631] overhead, spent [371ms] collecting in the last [1s]
[2019-07-14T06:50:53,209][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85650] overhead, spent [324ms] collecting in the last [1s]
[2019-07-14T06:50:54,210][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85651] overhead, spent [417ms] collecting in the last [1s]
[2019-07-14T06:50:55,210][INFO ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][85652] overhead, spent [309ms] collecting in the last [1s]

[14/07/2019 11:23 AM] Arman Ajdani: [2019-07-14T06:52:41,986][WARN ][o.e.m.j.JvmGcMonitorService] [3aY8vv0] [gc][old][85731][302] duration [27.3s], collections [2]/[28.3s], total [27.3s]/[45.4s], memory [22.8gb]->[19.2gb]/[23.9gb], all_pools {[young] [45mb]->[122.5mb]/[665.6mb]}{[survivor] [83.1mb]->[0b]/[83.1mb]}{[old] [22.6gb]->[19.1gb]/[23.1gb]}

this is the log when fails happens

ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
1 40 99 22 5.31 3.79 3.77 mdi * 3aY8vv0
2 16 99 4 3.17 3.21 3.27 mdi - fSX3XcC

shards and indiceis are too long because I have about 8000 shards on this two nodes

You probably have too many shards per node.

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

And https://www.elastic.co/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right

Yeah but alotof shards are empty and alotof them are for metricbeat and logstash.

8000 shards on 2 nodes mean that you might need something like 200gb HEAP to manage them.

Shards take up heap space even if they are almost empty, so the count does matter, not least because it increases the size of the cluster state. This old blog post has an interesting, although somewhat extreme example. :slight_smile:

I have delete some indices and now about 3000 shards on every server, Is thsi okay or still too much shards on every node?

That still sounds like a lot.