Circuit Breakers not triggering before OOM on a node

Will_Godwin · April 11, 2015, 12:23am

I will have to prepare the logfile as it has some confidential information
in it, but I will post the basic rundown of what is happening.
Our situation is that the circuit breakers do not seem to keep us from
throwing an OOM/"stop-the-world" GC event, causing the node(s) to become
unresponsive and very quickly bringing down our cluster. We have seen this
happen once a day for the last week. The little background I can give you
without posting the log file is that it seems like a large query comes in
and one node gets an OOM while the other nodes trigger the circuit
breakers. It would be great if the OOM node would come back up and not
bring down our cluster however that is not the case.

We have 3 master nodes, 26 data only nodes and 1 client node in production.

Can someone who has experimented with the circuit breakers give me some
feedback as to why we are still getting OOMs related to a specific api
request even if we set all 3 circuit breakers to 1%?
Circuit Breakers seem to only work against single queries (not a single
api request) which does not help much when it comes to an enterprise
solution like ours. Is this a correct assumption?
Is there anything I can do on each node to ensure that we avoid OOMs?

a.Change the max heap size?
b.Change to G1GC?
c.Change the setting index.cache.field.type to soft to allow for more
aggressive GC?
d.Change the following JVM option settings CMSInitiatingOccupancyFraction
and UseCMSInitiatingOccupancyOnly?

Thanks,
Will

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8289fbd0-5a1f-4a15-b718-4dd5fbff1f3a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

warkolm · April 11, 2015, 1:03am

Answering these sorts of questions don't really make sense as you've given
us no context.
Can you provide what your current settings and versions are?

On 11 April 2015 at 10:23, Will Godwin willisgodwin4@gmail.com wrote:

I will have to prepare the logfile as it has some confidential information
in it, but I will post the basic rundown of what is happening.
Our situation is that the circuit breakers do not seem to keep us from
throwing an OOM/"stop-the-world" GC event, causing the node(s) to become
unresponsive and very quickly bringing down our cluster. We have seen this
happen once a day for the last week. The little background I can give you
without posting the log file is that it seems like a large query comes in
and one node gets an OOM while the other nodes trigger the circuit
breakers. It would be great if the OOM node would come back up and not
bring down our cluster however that is not the case.

We have 3 master nodes, 26 data only nodes and 1 client node in production.

Can someone who has experimented with the circuit breakers give me some
feedback as to why we are still getting OOMs related to a specific api
request even if we set all 3 circuit breakers to 1%?

Circuit Breakers seem to only work against single queries (not a single
api request) which does not help much when it comes to an enterprise
solution like ours. Is this a correct assumption?

Is there anything I can do on each node to ensure that we avoid OOMs?

a.Change the max heap size?
b.Change to G1GC?
c.Change the setting index.cache.field.type to soft to allow for more
aggressive GC?
d.Change the following JVM option settings CMSInitiatingOccupancyFraction
and UseCMSInitiatingOccupancyOnly?

Thanks,
Will

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8289fbd0-5a1f-4a15-b718-4dd5fbff1f3a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8289fbd0-5a1f-4a15-b718-4dd5fbff1f3a%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_fTHZwwugwzFS0n73wdTT1k5P6FkWMe%3D4cehJxkPOoQA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Kris_Davey_2 · April 12, 2015, 7:08pm

Currently we are running 1.4.2,

Each of the data nodes is running 2 instances of ES, except the 3 master
nodes and the single client node.

Each data node box has 40 cores and 128GB of RAM, 28GB is allocated to the
heap. Here are the settings in our yml file:

node.master: false
node.data: true
node.max_local_storage_nodes: 2
bootstrap.mlockall: true
transport.tcp.port: 9300
http.port: 9200
http.max_content_length: 400mb
gateway.recover_after_nodes: 25
gateway.recover_after_time: 1m
gateway.expected_nodes: 29
cluster.routing.allocation.node_concurrent_recoveries: 20
indices.recovery.max_bytes_per_sec: 200mb
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.timeout: 3s
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ['HM01:9300', 'HM02:9300', 'HM03:9300']
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
monitor.jvm.gc.young.warn: 1000ms
monitor.jvm.gc.young.info: 700ms
monitor.jvm.gc.young.debug: 400ms
monitor.jvm.gc.old.warn: 10s
monitor.jvm.gc.old.info: 5s
monitor.jvm.gc.old.debug: 2s
action.auto_create_index: .marvel-*
action.disable_delete_all_indices: true
indices.cache.filter.size: 15%
index.refresh_interval: -1
threadpool.search.type: fixed
threadpool.search.size: 48
threadpool.search.queue_size: 10000000
cluster.routing.allocation.cluster_concurrent_rebalance: 6
indices.store.throttle.type: none
index.reclaim_deletes_weight: 4.0
index.merge.policy.max_merge_at_once: 5
index.merge.policy.segments_per_tier: 5
marvel.agent.exporter.es.hosts: ['172.16.110.238:9200',
'172.16.110.237:9200']
marvel.agent.enabled: true
marvel.agent.interval: 30s
script.disable_dynamic: false

Please let me know if there are other settings you would like to see.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cefaf3df-ecd4-4c3e-8361-a110aefbfdf5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

lisak · June 24, 2015, 3:54pm

If you have 35% of Heap allocated for Lucene instances in standBy mode, then circuit breaker total.limit 70% will lead to OOME so you'll need to decrease it to 60% ? Is that correct?