I will have to prepare the logfile as it has some confidential information
in it, but I will post the basic rundown of what is happening.
Our situation is that the circuit breakers do not seem to keep us from
throwing an OOM/"stop-the-world" GC event, causing the node(s) to become
unresponsive and very quickly bringing down our cluster. We have seen this
happen once a day for the last week. The little background I can give you
without posting the log file is that it seems like a large query comes in
and one node gets an OOM while the other nodes trigger the circuit
breakers. It would be great if the OOM node would come back up and not
bring down our cluster however that is not the case.
We have 3 master nodes, 26 data only nodes and 1 client node in production.
Can someone who has experimented with the circuit breakers give me some
feedback as to why we are still getting OOMs related to a specific api
request even if we set all 3 circuit breakers to 1%?
Circuit Breakers seem to only work against single queries (not a single
api request) which does not help much when it comes to an enterprise
solution like ours. Is this a correct assumption?
Is there anything I can do on each node to ensure that we avoid OOMs?
a.Change the max heap size?
b.Change to G1GC?
c.Change the setting index.cache.field.type to soft to allow for more
aggressive GC?
d.Change the following JVM option settings CMSInitiatingOccupancyFraction
and UseCMSInitiatingOccupancyOnly?
I will have to prepare the logfile as it has some confidential information
in it, but I will post the basic rundown of what is happening.
Our situation is that the circuit breakers do not seem to keep us from
throwing an OOM/"stop-the-world" GC event, causing the node(s) to become
unresponsive and very quickly bringing down our cluster. We have seen this
happen once a day for the last week. The little background I can give you
without posting the log file is that it seems like a large query comes in
and one node gets an OOM while the other nodes trigger the circuit
breakers. It would be great if the OOM node would come back up and not
bring down our cluster however that is not the case.
We have 3 master nodes, 26 data only nodes and 1 client node in production.
Can someone who has experimented with the circuit breakers give me some
feedback as to why we are still getting OOMs related to a specific api
request even if we set all 3 circuit breakers to 1%?
Circuit Breakers seem to only work against single queries (not a single
api request) which does not help much when it comes to an enterprise
solution like ours. Is this a correct assumption?
Is there anything I can do on each node to ensure that we avoid OOMs?
a.Change the max heap size?
b.Change to G1GC?
c.Change the setting index.cache.field.type to soft to allow for more
aggressive GC?
d.Change the following JVM option settings CMSInitiatingOccupancyFraction
and UseCMSInitiatingOccupancyOnly?
If you have 35% of Heap allocated for Lucene instances in standBy mode, then circuit breaker total.limit 70% will lead to OOME so you'll need to decrease it to 60% ? Is that correct?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.