I'm currently upgrading our Elasticsearch cluster to a multi-petabyte cluster (~2.5 petabytes). I have some questions about scaling out Elasticsearch, etc.
First, I've found that cheap nucs with 12+ cores and 64GB of RAM, 1Gbps network, 2TB NVMe seem to make really nice cheap servers. The thing with mobile processors (like the types used in NUCs) is that they can generally scale to high speed for up to 30-60 seconds -- but Elasticsearch rarely will floor any core (unless doing heavy ingest / merging).
What I'm wondering is exactly how much data is sent between nodes during heavy aggregations, etc.? Are we talking about 100MB+? I'm curious is we should be thinking about upgrading the backbone of the cluster to 10Gbe since 100MB would take a second over 1Gbps but only .1 seconds over 10Gbps.
Also, would gen 4 PCIE NVME's see a better performance increase? How reliant is Elasticsearch on random I/O? My assumption is very since some searches will request pages all over the place on the NVMe. The newer Sabrent Rocket 4+ NVMe drives push up to 650,000 4k IOPs (32q).
My thinking was a perfect Elasticsearch data node for today would look something like:
128 GB Ram (< 32GB reserved for JVM)
PCIE 4.0 NVMe (500,000+ 4k IOPs)
10Gbpe Network interfaces on 10Gbe switch
Going with those specs, it seems like some of the recommendations that were applicable in the past don't really apply here (like shards needing to be 50GB or less). We've had shards in the multi-hundred GB range (partly by accident because we didn't use templates correctly). However, with 10GBe networks and 7GB+ per second sequential reads from the newer NVEs, a 100GB shard could easily be relocated in under five minutes.
Would we see a noticeable latency decrease going with a 10Gbe network if our data set has complex aggregations, etc.? I have no idea how much data is actually passed between nodes.
Thanks!



