We are currently running ES 6.5 in 71 nodes using hot-warm architecture. We have 3 master nodes, 34 warm nodes and 34 hot nodes. Hot nodes have 64GB of RAM , 30GB of them for the heap. In the warm nodes we have 128GB of RAM and also 30GB dedicated for the heap.
We've been suffering from some sudden crashes on the hot nodes, these crashes don't happen only when the ingestion rate it's at the peak. Since the cluster is fine with a higher ingestion rate I don't believe we are hitting any limit yet. I got the heap dump from the hot nodes when they crash and I see that 80% of the heap is being used by byte arrays, which means that 80% of the heap (24GB!) are byte arrays of documents we want to index.
I've also analyzed the tasks (GET _tasks?nodes=mynode&detailed) being executed in a hot node right before it crashes and I saw that there are more than 1300 bulk indexing tasks active in the node at that time, 1300 bulk indexing tasks are about 20GB!! of data. Some of those tasks have been running for more than 40 seconds! A healthy node shows about 100 bulk tasks being executed.
Why does ES allow to have 1300 bulk indexing tasks in a node if the bulk indexing queue size is only 10? shouldn't it be rejecting bulk requests if it's already executing 1300?? Is there a way to limit the amount of tasks being executed in a node at a time and reject if we cross certain limit?
I wanted to mention that there are no queries running in the hot nodes at all. I also wanted to mention that the cluster has been fine with higher ingestion rates and only sometimes it seems like some of the nodes get stuck with many indexing bulk requests, they go into full gcs all the time and that makes the node to crash from Out Of Memory, followed by the rest of the nodes. When one or two of the hot nodes start to suffer from Full GCs the rest of the hot nodes are totally fine. The document id is generated by ES so there shouldn't be any hotspotting as far as I know, and if there was, it should be happening all the time.
Honestly I'm running out of ideas and I don't know what else could I check to find out the root of the cause. So any help would be great!
Thanks in advance!