How to rightsize HEAP , long GC , ES 2.4

Kim-Kruse-Hansen · September 18, 2016, 9:33am

Hi There

I have recently deployed 2 separate ES 2.4 Clusters , 2 nodes in each cluster , running on Windows 2012 R2 standard , Virtual Machines on vmware , 12 Cores , 24 GB RAM , 1 TB disk. I am using Java 1.8 , update 102.

The clusters are index heavy , meaning , they are receiving around 300-400 million entries per day. Logs are being sent from Filebeat to Logstash , which sends the logs to ES. Indexing rate is around 3000-7000 a second.

I have reduced the shard count from 5 to 2 in my index and I am creating a new index every day.

My main problem is what is the correct approach to determine the correct ES heap size ??

I started out with heap size of around 8 GB. That let the cluster run for approximately 1 day and then it ran into a long GC old , which took several minutes. And I experience error message , that master has left.

I am now at a heap size of 16 GB , which lets it run for a few days , before again being hit by a long gc old. This night , I saw a GC old taking a whopping 42 min. I see a heap usage slowly increasing over time. Like some kind of memory leak.

So then I tried to make a script to monitor the heap and when it reaches 90% or more , I will gracefully restart the node , but I dont think this is the correct approach.

I would like to think , that ES is mature enough to not suffer from this problem.

So my question , what is the correct way of sizing the heap to avoid long gc ? What other settings , should I be concerned about in a cluster being index heavy ? I have done a lot of googling ofc , but I need a little help

We would like to add more logs into the cluster and add more nodes , but I need the clusters to be more stable before this is happening .

Regards

warkolm · September 18, 2016, 9:36am

How much data in the cluster, how many indices, how many shards.
What does your mapping look like, do you have lots of fielddata taking up heap?

Kim-Kruse-Hansen · September 18, 2016, 9:50am

Hi Mark

I have 2 daily indices. I have one index for my dnslogs and one index for winlogbeat , aka windows eventlogs.

The dns index is around 300 million per day , 100 GB , 2 shards , 1 replica , the winlogbeat is around 25 million , 40 GB , 5 shards , 1 replica.

I prune all indexes older than 8 days.

If I monitor with marvel , fielddata graph is consistently 0.

Regards

warkolm · September 18, 2016, 9:52am

What sort of queries are you doing against it?

Kim-Kruse-Hansen · September 18, 2016, 9:57am

Hi

Just very simple queries through kibana 4.6. Nothing fancy.

Like who queried a given dns domain. What computers have a given event_id ? very basic stuff.

And very few queries through the day. All queries respond fine and smooth.

Regards

Christian_Dahlqvist · September 18, 2016, 10:11am

What does your node configuration look like?

Kim-Kruse-Hansen · September 18, 2016, 10:24am

Hi

What exactly would you like to know ?

Regards

Christian_Dahlqvist · September 18, 2016, 11:53am

Anything that is not default.

Kim-Kruse-Hansen · September 18, 2016, 12:18pm

Anything that is not default ...

Well , I have changed logstash parameters to increase indexing performance.

-w 24 -b 3048

workers=24 , batchsize = 3048

I have increased output parameters in logstash.json

workers = 2, flush_size = 3048

These are settings from elasticsearch.yml

Indexing Settings for Writes

index.refresh_interval: 30s

path.data: d:\data
path.logs: c:\logs

cluster.routing.allocation.cluster_concurrent_rebalance: 4
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 6
indices.recovery.max_bytes_per_sec: 125mb
indices.recovery.concurrent_streams: 5
indices.memory.index_buffer_size: 30%

Kim-Kruse-Hansen · September 19, 2016, 8:47am

Hi

We had another long GC , which caused the Cluster to fail.

[2016-09-19 10:24:52,142][WARN ][monitor.jvm ] [node1] [gc][young][111431][3834] duration [2.9s], collections [1]/[3.5s], total [2.9s]/[3.2m], memory [13.7gb]->[9.3gb]/[19.3gb], all_pools {[young] [4.9gb]->[108.1mb]/[5.3gb]}{[survivor] [606.8mb]->[682.6mb]/[682.6mb]}{[old] [8.1gb]->[8.6gb]/[13.3gb]}
[2016-09-19 10:25:04,680][INFO ][monitor.jvm ] [node1] [gc][young][111443][3835] duration [908ms], collections [1]/[1.2s], total [908ms]/[3.2m], memory [14.5gb]->[9.7gb]/[19.3gb], all_pools {[young] [5.2gb]->[113mb]/[5.3gb]}{[survivor] [682.6mb]->[682.6mb]/[682.6mb]}{[old] [8.6gb]->[8.9gb]/[13.3gb]}
[2016-09-19 10:25:21,951][WARN ][monitor.jvm ] [node1] [gc][young][111457][3836] duration [2.9s], collections [1]/[4s], total [2.9s]/[3.3m], memory [14.2gb]->[10.5gb]/[19.3gb], all_pools {[young] [4.5gb]->[87.1mb]/[5.3gb]}{[survivor] [682.6mb]->[682.6mb]/[682.6mb]}{[old] [8.9gb]->[9.8gb]/[13.3gb]}
[2016-09-19 10:25:35,434][WARN ][monitor.jvm ] [node1] [gc][young][111469][3837] duration [1.8s], collections [1]/[2.3s], total [1.8s]/[3.3m], memory [15.5gb]->[11gb]/[19.3gb], all_pools {[young] [5gb]->[41.2mb]/[5.3gb]}{[survivor] [682.6mb]->[682.6mb]/[682.6mb]}{[old] [9.8gb]->[10.3gb]/[13.3gb]}
[2016-09-19 10:25:49,194][WARN ][monitor.jvm ] [node1] [gc][young][111481][3838] duration [2.3s], collections [1]/[2.5s], total [2.3s]/[3.3m], memory [16.2gb]->[12gb]/[19.3gb], all_pools {[young] [5.2gb]->[34.1mb]/[5.3gb]}{[survivor] [682.6mb]->[682.6mb]/[682.6mb]}{[old] [10.3gb]->[11.3gb]/[13.3gb]}
[2016-09-19 10:26:01,727][WARN ][monitor.jvm ] [node1] [gc][young][111491][3839] duration [2.2s], collections [1]/[3.1s], total [2.2s]/[3.4m], memory [16.4gb]->[12.6gb]/[19.3gb], all_pools {[young] [4.4gb]->[35mb]/[5.3gb]}{[survivor] [682.6mb]->[682.6mb]/[682.6mb]}{[old] [11.3gb]->[11.9gb]/[13.3gb]}
[2016-09-19 10:26:13,914][INFO ][monitor.jvm ] [node1] [gc][young][111503][3840] duration [762ms], collections [1]/[1s], total [762ms]/[3.4m], memory [17.8gb]->[12.9gb]/[19.3gb], all_pools {[young] [5.1gb]->[33.4mb]/[5.3gb]}{[survivor] [682.6mb]->[682.6mb]/[682.6mb]}{[old] [11.9gb]->[12.2gb]/[13.3gb]}
[2016-09-19 10:26:27,011][WARN ][monitor.jvm ] [node1] [gc][young][111515][3841] duration [1.1s], collections [1]/[1.9s], total [1.1s]/[3.4m], memory [17.7gb]->[13.2gb]/[19.3gb], all_pools {[young] [4.7gb]->[2.8mb]/[5.3gb]}{[survivor] [682.6mb]->[655.2mb]/[682.6mb]}{[old] [12.2gb]->[12.5gb]/[13.3gb]}
[2016-09-19 10:32:45,761][WARN ][monitor.jvm ] [node1] [gc][old][111536][3] duration [5.9m], collections [1]/[5.9m], total [5.9m]/[5.9m], memory [18.5gb]->[5gb]/[19.3gb], all_pools {[young] [5.2gb]->[108.8mb]/[5.3gb]}{[survivor] [655.2mb]->[0b]/[682.6mb]}{[old] [12.5gb]->[4.9gb]/[13.3gb]}

this time it took 5.9 minutes

Regards

Christian_Dahlqvist · September 19, 2016, 11:11am

That is indeed a very long GC. Do you perhaps have swap enabled or is the host overcommitted with respect to memory?

Kim-Kruse-Hansen · September 19, 2016, 3:17pm

Hi

The same behaviour occurs on 2 separate ES clusters.

The first cluster has 24 GB RAM on each node and second cluster has 32 GB RAM.

I have changed the configuration today.

bootstrap.mlockall: true
Memory reservation on VMWare

So need to wait till I see the effect of these settings. If any ...

I can provide a JVM heap dump , if anyone wants to take a look.

Regards

warkolm · September 19, 2016, 8:07pm

Having ballooning memory is bad.

Kim-Kruse-Hansen · September 30, 2016, 7:18pm

Well , I found the root cause for this problem.

It turned to be the Garbage Collector used in ES 2.4.0. I installed the new 2.4.1 update of ES on my clusters and finally I experience the correct sawtooth pattern in heap usage. Climbs to 75% and then drops like expected.

I have to admit that this bug is so bad and so embarrasing for the quality of ES on Windows , that I am very surprised , that this can go unnoticed for so long.

I would like some comment from ES on how such a fatal bug is able to pass your quality test ? Dont you have any Windows Clusters running ES ? Why is ES able to run at all with the serial garbage collector , without a giant warning in the logs ? Better yet , it should refuse to run at all , with the serial garbage collector enabled.

Regards

Clinton_Gormley · October 1, 2016, 2:30pm

Why is ES able to run at all with the serial garbage collector , without a giant warning in the logs ? Better yet , it should refuse to run at all , with the serial garbage collector enabled.

We agree. See Add serial collector bootstrap check by jasontedor · Pull Request #20558 · elastic/elasticsearch · GitHub

I would like some comment from ES on how such a fatal bug is able to pass your quality test ?

Oversight. If you configure a particular garbage collector in your application, do you have a check to ensure that the JVM is using that GC collector? Probably not. Turns out we should. Live and learn

Topic		Replies	Views
Elasticsearch heap issues Elasticsearch	4	435	July 5, 2017
Production cluster slows down after 15-20 days of starting the services Elasticsearch	8	955	July 5, 2017
Cluster (ES 5.2) performance degrading after indexing Elasticsearch	3	507	June 6, 2017
Long running GC, cluster status RED, only few GB's data Elasticsearch	12	2519	July 5, 2017
ES 7.4 GC keeps reclaiming less memory on each pass Elasticsearch	9	527	May 24, 2020

How to rightsize HEAP , long GC , ES 2.4

Indexing Settings for Writes

Related topics