I recently built a cluster with 29 hosts. 3 of the hosts have dual 20 core CPUs with 512Gb of memory. On each of those 3 hosts I set up a master instance with 32Gb/mem and 2 CPUs and 4 ingest instances with 32Gb/mem and 4 CPUs. The remaining 26 hosts have dual 14-core CPUs with 512Gb of memory and 6 1.6Tb SSDs. Each of those hosts have 3 data instances with 7 32Gb/mem and 8 CPUs with two of the SSDs dedicated to each instanace. Using a stress test tool written in Python I get about 500K docs/sec (10Tb/hour) throughput. I'm fairly happy with those results, but I have 78 spare hosts (for a short while) with the same spces as the 26 I'm using for compute nodes so I wanted to see how ES scaled. I set up the remaining 78 nodes like the data nodes. The problem is that the cluster went to crap. I'm now only getting about 30K docs/sec when scaling - I'm seeing a lot of nodes drop out of the cluster and then rejoin. Eventually one of the master nodes will fail and the cluster basically freezes.
I'm also running the same stress test between the 78 data node cluster and the 312 data node cluster, but he 312 data node cluster fails miserably. I'm not sure where to look at this point.