So, we have 26 node cluster (3 masters, 3 clients behind an HAProxy, and 20 "data" nodes). The HAProxy and 3 clients are the only ones with HTTP turned on..the masters and data nodes do not. The data nodes are beefy bare metal boxes, with the HEAP set to 32 GB, and lots of disk. Currently, we have approx. 2.7TB of data with 4.5Bilion docs (not including the replicas), spread across 52+ indexes. Running Java 1.7, ES 1.6.1 on CentOS 6.6
During bulk ingest, of 1000 increments, by various users/applications, we are seeing nodes drop off, (with HEAP blowing up on those nodes), which then causes shard allocation, which causes more fun..and then finally all 20 nodes are down. The masters appear to be up and fine, as well as the clients while the nodes around them crash.
The errors we are seeing in the logs do not appear to be too specific other than.."cannot connect" type errors to nodes that have stopped.
My first basic question is: does those numbers seen crazy? Too high? normal, etc..(my guess is..should be fine).