Long running GC, cluster status RED, only few GB's data

Astro-es · February 25, 2016, 6:59am

It has been long since we're facing ES cluster down issues regularly generally when Garbage collection initiates, although data size is relatively small and enough memory is available.

ES Setup:
ES version: 2.0.0
Cluster size:2 nodes
Memory:32 GB

Config:
ES Heap Size: 50% of RAM
index.number_of_shards: 2
index.number_of_replicas: 1
indices.fielddata.cache.size: 20%
bootstrap.mlockall: true
Heap % used is not consistently > 75%.

Data:
Total no of open indices: 11
Data size 6 GB

[2016-02-08 16:34:22,033][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][1913][491] duration [35.7s], collections [1]/[35.8s], total [35.7s]/[3.1h], memory [15.9gb]->[15.9gb]/[15.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [221.7kb]->[5.2mb]/[33.2mb]}{[old] [15.6gb]->[15.6gb]/[15.6gb]}
[2016-02-08 16:34:46,383][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][1914][492] duration [24.2s], collections [1]/[24.3s], total [24.2s]/[3.1h], memory [15.9gb]->[15.9gb]/[15.9gb], all_pools {[young] [266.2mb]->[242.4mb]/[266.2mb]}{[survivor] [5.2mb]->[0b]/[33.2mb]}{[old] [15.6gb]->[15.6gb]/[15.6gb]}
[2016-02-08 16:35:20,045][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][1915][493] duration [33.4s], collections [1]/[33.6s], total [33.4s]/[3.1h], memory [15.9gb]->[15.9gb]/[15.9gb], all_pools {[young] [242.4mb]->[266.2mb]/[266.2mb]}{[survivor] [0b]->[268.9kb]/[33.2mb]}{[old] [15.6gb]->[15.6gb]/[15.6gb]}
[2016-02-08 16:36:16,011][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][1917][495] duration [33.8s], collections [1]/[33.9s], total [33.8s]/[3.1h], memory [15.9gb]->[15.9gb]/[15.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [17.1mb]->[26.5mb]/[33.2mb]}{[old] [15.6gb]->[15.6gb]/[15.6gb]}
[2016-02-08 16:36:36,249][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][1918][496] duration [20.2s], collections [1]/[20.2s], total [20.2s]/[3.1h], memory [15.9gb]->[15.9gb]/[15.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [26.5mb]->[29.6mb]/[33.2mb]}{[old] [15.6gb]->[15.6gb]/[15.6gb]}
[2016-02-08 16:37:08,229][WARN ][monitor.jvm]

We regularly close index (one per day) to save on memory. Keeping only 7 days old data where each index is 500-800MB in size.

Questions:

Why do these long-running GC start although there is enough free memory available. Considering 6GB data size?
Is there any additional configurations required to avoid this stop the world situation?

Any help will be appreciated.

nik9000 · February 25, 2016, 3:38pm

Its really hard to tell what is causing this without knowing things like what version of elasticsearch you are using. This blog post talks about it though it is a little old. In 2.x doc values are the default. You should be able to add doc values to the template that you use to build your next day's index. That will make the index take up more space on disk but less space in heap memory.

warkolm · February 25, 2016, 10:21pm

What sort of document structure is this?
What sort of queries are you doing?

Astro-es · February 26, 2016, 1:33am

We store logs from application server and nginx. We do text based searching through Kibana.

There are some scheduled aggregation jobs run every 24 hours. Also we don't query it as frequently.

Astro-es · March 1, 2016, 1:32am

@warkolm
Any suggestions on this?

msimos · March 1, 2016, 1:57am

When you start to see this problem can you run the following command. This is all one line.

curl  'http://localhost:9200/_cat/nodes?v&h=name,heap.max,heap.current,heap.percent,fielddata.memory_size,completion.size,filter_cache.memory_size,percolate.memory_size,segments.memory,segments.index_writer_memory,segments.version_map_memory'

You can also run this periodically to get an idea of your memory usage.

Astro-es · March 3, 2016, 6:49am

Thanks @msimos,

I have collected output of the above command when the cluster went down:
Node stats:
curl 'http://localhost:9200/_cat/nodes?v&h=name,heap.max,heap.current,heap.percent,fielddata.memory_size,completion.size,filter_cache.memory_size,percolate.memory_size,segments.memory,segments.index_writer_memory,segments.version_map_memory'
name heap.max heap.current heap.percent fielddata.memory_size completion.size percolate.memory_size segments.memory segments.index_writer_memory segments.version_map_memory
prod-elastic-node-1 15.9gb 15.9gb 99 187.3mb 0b -1b 34.5mb 0b 0b

The GC traces:

2016-03-03 06:22:50,372][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][284][2] duration [26.6s], collections [1]/[26.7s], total [26.6s]/[29.6s], memory [15.7gb]->[15gb]/[15.9gb], all_pools {[young] [128.7mb]->[2.2mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [15.6gb]->[15gb]/[15.6gb]}
[2016-03-03 06:23:26,084][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][287][3] duration [32.5s], collections [1]/[33.5s], total [32.5s]/[1m], memory [15.4gb]->[15.7gb]/[15.9gb], all_pools {[young] [10.8mb]->[61.2mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [15.4gb]->[15.6gb]/[15.6gb]}
[2016-03-03 06:23:59,447][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][288][4] duration [33s], collections [1]/[33.3s], total [33s]/[1.5m], memory [15.7gb]->[15.7gb]/[15.9gb], all_pools {[young] [61.2mb]->[74.3mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [15.6gb]->[15.6gb]/[15.6gb]}

free -m:

         total       used       free     shared    buffers     cached

Mem: 30679 28537 2142 0 287 8634
-/+ buffers/cache: 19616 11063
Swap: 0 0 0

Open Indexes:
green open logstash-2016.02.05 2 1 1021019 0 307.9mb 153.9mb
green open logstash-2016.03.02 2 1 4145097 0 1.2gb 618.1mb
green open logstash-2016.02.27 2 1 5146301 0 1.2gb 644.4mb
green open logstash-2016.02.06 2 1 7438119 0 1.5gb 784mb
green open logstash-2016.02.29 2 1 2920126 0 1004.3mb 501.7mb
green open logstash-2016.03.03 2 1 5601321 0 1.1gb 563.8mb
green open logstash-2016.02.07 2 1 1874420 0 508.5mb 254.2mb
green open logstash-2016.03.01 2 1 6835395 0 1.7gb 879.2mb
green open logstash-2016.02.26 2 1 3248001 0 1gb 536.1mb
green open logstash-2016.02.28 2 1 1794958 0 638.6mb 318.1mb
green open logstash-2016.02.08 2 1 1031914 0 506.7mb 253.3mb

The total size of data less than heap size. The total heap size is set to 16GB(50% of RAM)
Why is the heap.current is 15.9gb in this case?
Also there is ~2GB Memory is still available on server.
What is causing the so much memory usage?

Any help?

warkolm · March 3, 2016, 7:11am

If you don't have Marvel installed, you should. It'll give you graphical insight into what is happening and may make it easier to see what is happening.

msimos · March 4, 2016, 12:25am

Unfortunately the problem doesn't seem to be field data related. My guess is there is a query running which is causing the high heap usage. Either a complex aggregation or something returning a lot of results. I would enable the slow log and see if you can determine what queries are running:

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-slowlog.html

Otherwise I would check if you have changed any of the threadpool settings:

https://www.elastic.co/guide/en/elasticsearch/reference/2.2/modules-threadpool.html#modules-threadpool

If you increased either size or queue_size for bulk, then it may lead to this kind of situation.

I also noticed you said you had 2 nodes but the _cat API only shows one node. So one or two nodes for what you're doing may not be sufficient.

Astro-es · March 23, 2016, 6:17am

@msimos
Thanks for the slow logs links. I have enabled slow logs but found that these logs are generated once in a week. Otherwise, no slow logs or slow indexing_slowlog generated (that means no slow calls are found for these days) but the GC is triggered which makes cluster unavailable.

Moreover, I haven't changed the threadpool settings.

These issue becomes more severe as we have only ~7GB of open indices on cluster with 32GB memory. The garbage collection starts and cluster becomes unavailable, so the Kibana.

Any thoughts would be helpful.

Astro-es · March 26, 2016, 6:34pm

any help on gc's?

warkolm · March 27, 2016, 6:35am

Did you install Marvel so you can see some stats on what is happening?

Topic		Replies	Views
Elasticsearch Heap size growing with time and lot of GC, eventually pulling the cluster down Elasticsearch	6	2579	July 5, 2017
ElasticSearch gc performance on cluster Elasticsearch	3	654	July 5, 2017
Long GC on Elasticsearch master nodes Elasticsearch	4	1659	March 16, 2017
ES heap components Elasticsearch	4	1408	February 28, 2017
Es instance unusable Elasticsearch	7	738	October 12, 2020

Long running GC, cluster status RED, only few GB's data

Related topics