ES filling up the 'old' GC pool

Hi all

We're running a three node elasticsearch cluster (two data nodes, one
dataless) and using it to store data from logstash.

Every week or two, we see the following message in the elasticsearch logs:

[24.8gb]->[24.5gb]/[24.8gb], all_pools {[young]
[865.3mb]->[586mb]/[865.3mb]}{[survivor] [102.5mb]->[0b]/[108.1mb]}{[old]
[23.9gb]->[23.9gb]/[23.9gb]}
[2014-11-17 15:26:15,066][WARN ][monitor.jvm ] [es-prod-2]
[gc][old][1189982][81480] duration [14.9s], collections [1]/[15.7s], total
[14.9s]/[16.1h], memory
[24.5gb]->[24.5gb]/[24.8gb], all_pools {[young]
[586mb]->[592.5mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old]
[23.9gb]->[23.9gb]/[23.9gb]}
[2014-11-17 15:26:30,715][WARN ][monitor.jvm ] [es-prod-2]
[gc][old][1189983][81481] duration [14.6s], collections [1]/[15.6s], total
[14.6s]/[16.1h], memory
[24.5gb]->[24.5gb]/[24.8gb], all_pools {[young]
[592.5mb]->[589.1mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old]
[23.9gb]->[23.9gb]/[23.9gb]}
[2014-11-17 15:26:46,705][WARN ][monitor.jvm ] [es-prod-2]
[gc][old][1189984][81482] duration [15.2s], collections [1]/[15.9s], total
[15.2s]/[16.1h], memory
[24.5gb]->[24.3gb]/[24.8gb], all_pools {[young]
[589.1mb]->[445.2mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old]
[23.9gb]->[23.9gb]/[23.9gb]}
[2014-11-17 15:27:03,630][WARN ][monitor.jvm ] [es-prod-2]
[gc][old][1189986][81483] duration [15.8s], collections [1]/[15.9s], total
[15.8s]/[16.1h], memory
[24.8gb]->[24.3gb]/[24.8gb], all_pools {[young]
[865.3mb]->[461.7mb]/[865.3mb]}{[survivor] [91.8mb]->[0b]/[108.1mb]}{[old]
[23.9gb]->[23.9gb]/[23.9gb]}

When this occurs, search performance becomes very slow. Even a simple $ curl http://es-prod-2:9200 can take around ten seconds.

The daily indexes created by logstash vary between 5M and 80M documents,
and 1.5GiB and 25GiB on disk. The data nodes have ES_HEAP_SIZE=25G (we saw
OOM errors with 15G and going over 30GiB is not recommended I believe).

I suspect this occurs when users try to query over a large numbers of
indexes in Kibana.

My questions are:

1: How should I tune our cluster to handle these queries? Is our dataset
simply too big?

2: When this happens, I restart the bad node by:

curl -XPUT "http://$HOST:$PORT/_cluster/settings?pretty" -d ' {
"transient": {
"cluster.routing.allocation.enable": "none"
}
}'

curl -XPUT "http://$HOST:$PORT/_cluster/settings?pretty" -d '{
"transient" : {
"cluster.routing.allocation.enable" : "none"
}
}'

(start the node again)

curl -XPUT "http://$HOST:$PORT/_cluster/settings?pretty" -d ' {
"transient": {
"cluster.routing.allocation.enable": "all"
}
}'

It's then an hour or two before the cluster is green again, as the shards
are assigned and then initialized. Is this the best way to restart a bad
node?

3: Can I remove the ability for users to make such intensive requests from
Kibana (either a Kibana setting or an ES setting)?

Thanks
Wilfred

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/895d2c10-e64e-432b-9c3b-f285945d951d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

We're running elasticsearch 1.2.4 on Java 1.7.0_40, for what it's worth.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/44213a82-7604-430c-9017-8d4398f9694d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

FWIW, we saw many long running GC events using the default GC manager -
changing to G1 solved most of the problems ( at the expense of slightly
higher CPU all the time).... After that you can take the longer road to
debugging memory allocation for your use case :slight_smile:
On 18/11/2014 6:21 am, "Wilfred Hughes" yowilfred@gmail.com wrote:

We're running elasticsearch 1.2.4 on Java 1.7.0_40, for what it's worth.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/44213a82-7604-430c-9017-8d4398f9694d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/44213a82-7604-430c-9017-8d4398f9694d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACj2-4K5uo6BRnEhxv109gorBigVTzoCt_Fcjqgv5BYXA70u5A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Interesting, I'd also been considering moving to G1. I understand that it
works well with large (>4GiB) heap sizes compared with the default. I'll
give it a try.

I'm not sure how to imitate production load on a test cluster -- it's a
diverse range of data, very bursty and high average throughput. I'll
increase GC logging on the production ES cluster to try to make the long GC
tuning road more pleasant :slight_smile:

On Tuesday, 18 November 2014 21:18:01 UTC, Norberto Meijome wrote:

FWIW, we saw many long running GC events using the default GC manager -
changing to G1 solved most of the problems ( at the expense of slightly
higher CPU all the time).... After that you can take the longer road to
debugging memory allocation for your use case :slight_smile:
On 18/11/2014 6:21 am, "Wilfred Hughes" <yowi...@gmail.com <javascript:>>
wrote:

We're running elasticsearch 1.2.4 on Java 1.7.0_40, for what it's worth.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/44213a82-7604-430c-9017-8d4398f9694d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/44213a82-7604-430c-9017-8d4398f9694d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/25a5b535-a2ed-4103-9037-853a56af7c79%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.