Long running GC, cluster status RED, only few GB's data

It has been long since we're facing ES cluster down issues regularly generally when Garbage collection initiates, although data size is relatively small and enough memory is available.

ES Setup:
ES version: 2.0.0
Cluster size:2 nodes
Memory:32 GB

Config:
ES Heap Size: 50% of RAM
index.number_of_shards: 2
index.number_of_replicas: 1
indices.fielddata.cache.size: 20%
bootstrap.mlockall: true
Heap % used is not consistently > 75%.

Data:
Total no of open indices: 11
Data size 6 GB

[2016-02-08 16:34:22,033][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][1913][491] duration [35.7s], collections [1]/[35.8s], total [35.7s]/[3.1h], memory [15.9gb]->[15.9gb]/[15.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [221.7kb]->[5.2mb]/[33.2mb]}{[old] [15.6gb]->[15.6gb]/[15.6gb]}
[2016-02-08 16:34:46,383][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][1914][492] duration [24.2s], collections [1]/[24.3s], total [24.2s]/[3.1h], memory [15.9gb]->[15.9gb]/[15.9gb], all_pools {[young] [266.2mb]->[242.4mb]/[266.2mb]}{[survivor] [5.2mb]->[0b]/[33.2mb]}{[old] [15.6gb]->[15.6gb]/[15.6gb]}
[2016-02-08 16:35:20,045][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][1915][493] duration [33.4s], collections [1]/[33.6s], total [33.4s]/[3.1h], memory [15.9gb]->[15.9gb]/[15.9gb], all_pools {[young] [242.4mb]->[266.2mb]/[266.2mb]}{[survivor] [0b]->[268.9kb]/[33.2mb]}{[old] [15.6gb]->[15.6gb]/[15.6gb]}
[2016-02-08 16:36:16,011][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][1917][495] duration [33.8s], collections [1]/[33.9s], total [33.8s]/[3.1h], memory [15.9gb]->[15.9gb]/[15.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [17.1mb]->[26.5mb]/[33.2mb]}{[old] [15.6gb]->[15.6gb]/[15.6gb]}
[2016-02-08 16:36:36,249][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][1918][496] duration [20.2s], collections [1]/[20.2s], total [20.2s]/[3.1h], memory [15.9gb]->[15.9gb]/[15.9gb], all_pools {[young] [266.2mb]->[266.2mb]/[266.2mb]}{[survivor] [26.5mb]->[29.6mb]/[33.2mb]}{[old] [15.6gb]->[15.6gb]/[15.6gb]}
[2016-02-08 16:37:08,229][WARN ][monitor.jvm]

We regularly close index (one per day) to save on memory. Keeping only 7 days old data where each index is 500-800MB in size.

Questions:

  1. Why do these long-running GC start although there is enough free memory available. Considering 6GB data size?

  2. Is there any additional configurations required to avoid this stop the world situation?

Any help will be appreciated.

Its really hard to tell what is causing this without knowing things like what version of elasticsearch you are using. This blog post talks about it though it is a little old. In 2.x doc values are the default. You should be able to add doc values to the template that you use to build your next day's index. That will make the index take up more space on disk but less space in heap memory.

What sort of document structure is this?
What sort of queries are you doing?

:stuck_out_tongue:

We store logs from application server and nginx. We do text based searching through Kibana.

There are some scheduled aggregation jobs run every 24 hours. Also we don't query it as frequently.

@warkolm
Any suggestions on this?

When you start to see this problem can you run the following command. This is all one line.

curl  'http://localhost:9200/_cat/nodes?v&h=name,heap.max,heap.current,heap.percent,fielddata.memory_size,completion.size,filter_cache.memory_size,percolate.memory_size,segments.memory,segments.index_writer_memory,segments.version_map_memory'

You can also run this periodically to get an idea of your memory usage.

Thanks @msimos,

I have collected output of the above command when the cluster went down:
Node stats:
curl 'http://localhost:9200/_cat/nodes?v&h=name,heap.max,heap.current,heap.percent,fielddata.memory_size,completion.size,filter_cache.memory_size,percolate.memory_size,segments.memory,segments.index_writer_memory,segments.version_map_memory'
name heap.max heap.current heap.percent fielddata.memory_size completion.size percolate.memory_size segments.memory segments.index_writer_memory segments.version_map_memory
prod-elastic-node-1 15.9gb 15.9gb 99 187.3mb 0b -1b 34.5mb 0b 0b

The GC traces:

2016-03-03 06:22:50,372][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][284][2] duration [26.6s], collections [1]/[26.7s], total [26.6s]/[29.6s], memory [15.7gb]->[15gb]/[15.9gb], all_pools {[young] [128.7mb]->[2.2mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [15.6gb]->[15gb]/[15.6gb]}
[2016-03-03 06:23:26,084][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][287][3] duration [32.5s], collections [1]/[33.5s], total [32.5s]/[1m], memory [15.4gb]->[15.7gb]/[15.9gb], all_pools {[young] [10.8mb]->[61.2mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [15.4gb]->[15.6gb]/[15.6gb]}
[2016-03-03 06:23:59,447][WARN ][monitor.jvm ] [elastic-node-1] [gc][old][288][4] duration [33s], collections [1]/[33.3s], total [33s]/[1.5m], memory [15.7gb]->[15.7gb]/[15.9gb], all_pools {[young] [61.2mb]->[74.3mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [15.6gb]->[15.6gb]/[15.6gb]}

free -m:

         total       used       free     shared    buffers     cached

Mem: 30679 28537 2142 0 287 8634
-/+ buffers/cache: 19616 11063
Swap: 0 0 0

Open Indexes:
green open logstash-2016.02.05 2 1 1021019 0 307.9mb 153.9mb
green open logstash-2016.03.02 2 1 4145097 0 1.2gb 618.1mb
green open logstash-2016.02.27 2 1 5146301 0 1.2gb 644.4mb
green open logstash-2016.02.06 2 1 7438119 0 1.5gb 784mb
green open logstash-2016.02.29 2 1 2920126 0 1004.3mb 501.7mb
green open logstash-2016.03.03 2 1 5601321 0 1.1gb 563.8mb
green open logstash-2016.02.07 2 1 1874420 0 508.5mb 254.2mb
green open logstash-2016.03.01 2 1 6835395 0 1.7gb 879.2mb
green open logstash-2016.02.26 2 1 3248001 0 1gb 536.1mb
green open logstash-2016.02.28 2 1 1794958 0 638.6mb 318.1mb
green open logstash-2016.02.08 2 1 1031914 0 506.7mb 253.3mb

The total size of data less than heap size. The total heap size is set to 16GB(50% of RAM)
Why is the heap.current is 15.9gb in this case?
Also there is ~2GB Memory is still available on server.
What is causing the so much memory usage?

Any help?

If you don't have Marvel installed, you should. It'll give you graphical insight into what is happening and may make it easier to see what is happening.

Unfortunately the problem doesn't seem to be field data related. My guess is there is a query running which is causing the high heap usage. Either a complex aggregation or something returning a lot of results. I would enable the slow log and see if you can determine what queries are running:

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-slowlog.html

Otherwise I would check if you have changed any of the threadpool settings:

https://www.elastic.co/guide/en/elasticsearch/reference/2.2/modules-threadpool.html#modules-threadpool

If you increased either size or queue_size for bulk, then it may lead to this kind of situation.

I also noticed you said you had 2 nodes but the _cat API only shows one node. So one or two nodes for what you're doing may not be sufficient.

@msimos
Thanks for the slow logs links. I have enabled slow logs but found that these logs are generated once in a week. Otherwise, no slow logs or slow indexing_slowlog generated (that means no slow calls are found for these days) but the GC is triggered which makes cluster unavailable.

Moreover, I haven't changed the threadpool settings.

These issue becomes more severe as we have only ~7GB of open indices on cluster with 32GB memory. The garbage collection starts and cluster becomes unavailable, so the Kibana.

Any thoughts would be helpful.

any help on gc's?

Did you install Marvel so you can see some stats on what is happening?