I will have to prepare the logfile as it has some confidential information
in it, but I will post the basic rundown of what is happening.
Our situation is that the circuit breakers do not seem to keep us from
throwing an OOM/"stop-the-world" GC event, causing the node(s) to become
unresponsive and very quickly bringing down our cluster. We have seen this
happen once a day for the last week. The little background I can give you
without posting the log file is that it seems like a large query comes in
and one node gets an OOM while the other nodes trigger the circuit
breakers. It would be great if the OOM node would come back up and not
bring down our cluster however that is not the case.
We have 3 master nodes, 26 data only nodes and 1 client node in production.
- Can someone who has experimented with the circuit breakers give me some
feedback as to why we are still getting OOMs related to a specific api
request even if we set all 3 circuit breakers to 1%?
- Circuit Breakers seem to only work against single queries (not a single
api request) which does not help much when it comes to an enterprise
solution like ours. Is this a correct assumption?
- Is there anything I can do on each node to ensure that we avoid OOMs?
a.Change the max heap size?
b.Change to G1GC?
c.Change the setting index.cache.field.type to soft to allow for more
d.Change the following JVM option settings CMSInitiatingOccupancyFraction
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firstname.lastname@example.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8289fbd0-5a1f-4a15-b718-4dd5fbff1f3a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.