I've been running some aggresive tests (on purposely reduced hardware specs) and faced the following issue. My cluster is:
- ES 2.2.0 on CentOS 6.7
- 4x nodes with 4GB RAM (2GB for the JVM), 8 cores.
- Extended the bulk thread queue size from 50 to 1000
- 20x clients running on the REST api, sending bulk indexing requests with 1000 records each
Then, I've run a few heavy aggregates. The idea is to force ES to retrieve the data from disk (avoiding FS cache whatsoever) to measure some sort of "worst case scenario"
One of the nodes crashed due to OutOfMemory exception during one of the aggregates.** This caused corruption on 100% shards of that node**! Now the cluster is relocating shards but it's painfully slow.
I find hard to believe that a query - as heavy as it is - can cause full corruption of a node. This would be a no-go for production in the project I'm working at.
Is there a way to ensure that the queries memory does not take down the full node? I've been wondering if lowering the three available circuit breakers so they don't sum more than 100% will help (https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html) :
But I find this really annoying - lowering these limits "just in case" doesn't seem reasonable.
Any other ideas / tips?
The full trace can be found here: http://pastie.org/pastes/10722735/text?key=8jrwowaumantpoqjvctq