ES filling up the 'old' GC pool

Wilfred_Hughes · November 17, 2014, 3:56pm

Hi all

We're running a three node elasticsearch cluster (two data nodes, one
dataless) and using it to store data from logstash.

Every week or two, we see the following message in the elasticsearch logs:

[24.8gb]->[24.5gb]/[24.8gb], all_pools {[young]
[865.3mb]->[586mb]/[865.3mb]}{[survivor] [102.5mb]->[0b]/[108.1mb]}{[old]
[23.9gb]->[23.9gb]/[23.9gb]}
[2014-11-17 15:26:15,066][WARN ][monitor.jvm ] [es-prod-2]
[gc][old][1189982][81480] duration [14.9s], collections [1]/[15.7s], total
[14.9s]/[16.1h], memory
[24.5gb]->[24.5gb]/[24.8gb], all_pools {[young]
[586mb]->[592.5mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old]
[23.9gb]->[23.9gb]/[23.9gb]}
[2014-11-17 15:26:30,715][WARN ][monitor.jvm ] [es-prod-2]
[gc][old][1189983][81481] duration [14.6s], collections [1]/[15.6s], total
[14.6s]/[16.1h], memory
[24.5gb]->[24.5gb]/[24.8gb], all_pools {[young]
[592.5mb]->[589.1mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old]
[23.9gb]->[23.9gb]/[23.9gb]}
[2014-11-17 15:26:46,705][WARN ][monitor.jvm ] [es-prod-2]
[gc][old][1189984][81482] duration [15.2s], collections [1]/[15.9s], total
[15.2s]/[16.1h], memory
[24.5gb]->[24.3gb]/[24.8gb], all_pools {[young]
[589.1mb]->[445.2mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old]
[23.9gb]->[23.9gb]/[23.9gb]}
[2014-11-17 15:27:03,630][WARN ][monitor.jvm ] [es-prod-2]
[gc][old][1189986][81483] duration [15.8s], collections [1]/[15.9s], total
[15.8s]/[16.1h], memory
[24.8gb]->[24.3gb]/[24.8gb], all_pools {[young]
[865.3mb]->[461.7mb]/[865.3mb]}{[survivor] [91.8mb]->[0b]/[108.1mb]}{[old]
[23.9gb]->[23.9gb]/[23.9gb]}

When this occurs, search performance becomes very slow. Even a simple $ curl http://es-prod-2:9200 can take around ten seconds.

The daily indexes created by logstash vary between 5M and 80M documents,
and 1.5GiB and 25GiB on disk. The data nodes have ES_HEAP_SIZE=25G (we saw
OOM errors with 15G and going over 30GiB is not recommended I believe).

I suspect this occurs when users try to query over a large numbers of
indexes in Kibana.

My questions are:

1: How should I tune our cluster to handle these queries? Is our dataset
simply too big?

2: When this happens, I restart the bad node by:

curl -XPUT "http://$HOST:$PORT/_cluster/settings?pretty" -d ' {
"transient": {
"cluster.routing.allocation.enable": "none"
}
}'

curl -XPUT "http://$HOST:$PORT/_cluster/settings?pretty" -d '{
"transient" : {
"cluster.routing.allocation.enable" : "none"
}
}'

(start the node again)

curl -XPUT "http://$HOST:$PORT/_cluster/settings?pretty" -d ' {
"transient": {
"cluster.routing.allocation.enable": "all"
}
}'

It's then an hour or two before the cluster is green again, as the shards
are assigned and then initialized. Is this the best way to restart a bad
node?

3: Can I remove the ability for users to make such intensive requests from
Kibana (either a Kibana setting or an ES setting)?

Thanks
Wilfred

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/895d2c10-e64e-432b-9c3b-f285945d951d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wilfred_Hughes · November 18, 2014, 9:21am

We're running elasticsearch 1.2.4 on Java 1.7.0_40, for what it's worth.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/44213a82-7604-430c-9017-8d4398f9694d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Norberto_Meijome · November 18, 2014, 9:17pm

FWIW, we saw many long running GC events using the default GC manager -
changing to G1 solved most of the problems ( at the expense of slightly
higher CPU all the time).... After that you can take the longer road to
debugging memory allocation for your use case
On 18/11/2014 6:21 am, "Wilfred Hughes" yowilfred@gmail.com wrote:

We're running elasticsearch 1.2.4 on Java 1.7.0_40, for what it's worth.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/44213a82-7604-430c-9017-8d4398f9694d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/44213a82-7604-430c-9017-8d4398f9694d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACj2-4K5uo6BRnEhxv109gorBigVTzoCt_Fcjqgv5BYXA70u5A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Wilfred_Hughes · November 19, 2014, 9:32am

Interesting, I'd also been considering moving to G1. I understand that it
works well with large (>4GiB) heap sizes compared with the default. I'll
give it a try.

I'm not sure how to imitate production load on a test cluster -- it's a
diverse range of data, very bursty and high average throughput. I'll
increase GC logging on the production ES cluster to try to make the long GC
tuning road more pleasant

On Tuesday, 18 November 2014 21:18:01 UTC, Norberto Meijome wrote:

FWIW, we saw many long running GC events using the default GC manager -
changing to G1 solved most of the problems ( at the expense of slightly
higher CPU all the time).... After that you can take the longer road to
debugging memory allocation for your use case
On 18/11/2014 6:21 am, "Wilfred Hughes" <yowi...@gmail.com <javascript:>>
wrote:

We're running elasticsearch 1.2.4 on Java 1.7.0_40, for what it's worth.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/44213a82-7604-430c-9017-8d4398f9694d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/44213a82-7604-430c-9017-8d4398f9694d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/25a5b535-a2ed-4103-9037-853a56af7c79%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
GC warnings? Elasticsearch	4	1934	July 6, 2017
Long running GC, cluster status RED, only few GB's data Elasticsearch	12	2524	July 5, 2017
Elasticsearch and endless garbage collection Elasticsearch	1	405	July 6, 2017
ElasticSearch gc performance on cluster Elasticsearch	3	651	July 5, 2017
Elasticsearch error Elasticsearch	2	578	July 6, 2017

ES filling up the 'old' GC pool

Related topics