Client node crash with OOM exception

Hello,

We have setup a cluster with 3 masters 5 client and 8 data nodes With ES 1.7.1. We are seeing crashes in client nodes with OOM exception. According to the thread dump at the time of crash there are nearly 35 threads active on the node.

Here is the threadump

This crash happened in 2 of the 5 client nodes. total number of client threads bulk indexing in parallel = 24(on the entire cluster, they only contact the client nodes.).

Hi,

35 threads are not particularly much for a server process. From the thread dump I just see that a few HTTP handler threads are allocating buffer memory (presumably for incoming bulk requests). I suggest that you take a closer look at the heap dump instead (for example with Memory Analyzer which should reveal why the client nodes need so much memory. I guess it is because you're sending / buffering too much bulk index traffic for the amount of memory you've allocated. This means you either need to increase the system capacity or reduce the load on the client nodes. So you can try to:

  • Reduce bulk request size
  • Reduce the number of clients sending bulk index requests
  • Reduce incoming queue lengths
  • Increase heap memory on the client nodes
  • Add more data nodes (based on the premise that they cannot cope with the amount of traffic sent and they put backpressure on the client nodes)

But these are just pointers based on my assumption above. You need to check the heap dump and run further tests. I hope this gets you started.

Daniel