Newbie question, ES "sizing"?

So, we have 26 node cluster (3 masters, 3 clients behind an HAProxy, and 20 "data" nodes). The HAProxy and 3 clients are the only ones with HTTP turned on..the masters and data nodes do not. The data nodes are beefy bare metal boxes, with the HEAP set to 32 GB, and lots of disk. Currently, we have approx. 2.7TB of data with 4.5Bilion docs (not including the replicas), spread across 52+ indexes. Running Java 1.7, ES 1.6.1 on CentOS 6.6

During bulk ingest, of 1000 increments, by various users/applications, we are seeing nodes drop off, (with HEAP blowing up on those nodes), which then causes shard allocation, which causes more fun..and then finally all 20 nodes are down. The masters appear to be up and fine, as well as the clients while the nodes around them crash.

The errors we are seeing in the logs do not appear to be too specific other than.."cannot connect" type errors to nodes that have stopped.

My first basic question is: does those numbers seen crazy? Too high? normal, etc..(my guess is..should be fine).

You may want to reduce your heap to 30GB, that's the recommended limit.

How many shards do you have?

Isn't the condition that the heap needs to be strictly less than 32 GB for pointers to be compressed, i.e. 31 GB or even 31.9 GB should be fine?

30.5GB is the word from Oracle.

I used the link:

as the guide for 32GB setting. I can try the 30.9/31GB etc..see what happens.