We have setup a cluster with 3 masters 5 client and 8 data nodes With ES 1.7.1. We are seeing crashes in client nodes with OOM exception. According to the thread dump at the time of crash there are nearly 35 threads active on the node.
This crash happened in 2 of the 5 client nodes. total number of client threads bulk indexing in parallel = 24(on the entire cluster, they only contact the client nodes.).
35 threads are not particularly much for a server process. From the thread dump I just see that a few HTTP handler threads are allocating buffer memory (presumably for incoming bulk requests). I suggest that you take a closer look at the heap dump instead (for example with Memory Analyzer which should reveal why the client nodes need so much memory. I guess it is because you're sending / buffering too much bulk index traffic for the amount of memory you've allocated. This means you either need to increase the system capacity or reduce the load on the client nodes. So you can try to:
Reduce bulk request size
Reduce the number of clients sending bulk index requests
Reduce incoming queue lengths
Increase heap memory on the client nodes
Add more data nodes (based on the premise that they cannot cope with the amount of traffic sent and they put backpressure on the client nodes)
But these are just pointers based on my assumption above. You need to check the heap dump and run further tests. I hope this gets you started.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.