The perfect Elasticsearch data node

Jason_Baumgartner · August 9, 2021, 4:54pm

I'm currently upgrading our Elasticsearch cluster to a multi-petabyte cluster (~2.5 petabytes). I have some questions about scaling out Elasticsearch, etc.

First, I've found that cheap nucs with 12+ cores and 64GB of RAM, 1Gbps network, 2TB NVMe seem to make really nice cheap servers. The thing with mobile processors (like the types used in NUCs) is that they can generally scale to high speed for up to 30-60 seconds -- but Elasticsearch rarely will floor any core (unless doing heavy ingest / merging).

What I'm wondering is exactly how much data is sent between nodes during heavy aggregations, etc.? Are we talking about 100MB+? I'm curious is we should be thinking about upgrading the backbone of the cluster to 10Gbe since 100MB would take a second over 1Gbps but only .1 seconds over 10Gbps.

Also, would gen 4 PCIE NVME's see a better performance increase? How reliant is Elasticsearch on random I/O? My assumption is very since some searches will request pages all over the place on the NVMe. The newer Sabrent Rocket 4+ NVMe drives push up to 650,000 4k IOPs (32q).

My thinking was a perfect Elasticsearch data node for today would look something like:

128 GB Ram (< 32GB reserved for JVM)
PCIE 4.0 NVMe (500,000+ 4k IOPs)
10Gbpe Network interfaces on 10Gbe switch

Going with those specs, it seems like some of the recommendations that were applicable in the past don't really apply here (like shards needing to be 50GB or less). We've had shards in the multi-hundred GB range (partly by accident because we didn't use templates correctly). However, with 10GBe networks and 7GB+ per second sequential reads from the newer NVEs, a 100GB shard could easily be relocated in under five minutes.

Would we see a noticeable latency decrease going with a 10Gbe network if our data set has complex aggregations, etc.? I have no idea how much data is actually passed between nodes.

Thanks!

rcowart · August 10, 2021, 4:50pm

We have been doing a lot of work on extreme scale architectures for some of our customers, even exceeding 5M documents per second. We should have some more hardware this week to further validate our findings.

Let me start with the easy one... storage. Some older work on this topic can be found in this video, where two SATA SSDs beat an NVMe drive.

We recently revisited the above tests with a more powerful 16-core CPU and larger variety of drives, including 2x SATA SSD (MLC), 2x PCIe3 NVMe SSD (TLC) and 2x PCIe4 NVMe SSD (TLC) all in RAID0. The fastest was... the SATA SSDs! But it really was splitting hairs.

2x SATA SSD (MLC) RAID0

2x PCIe3 NVMe SSD (TLC) RAID0

2x PCIe4 NVMe SSD (TLC) RAID0

The above was a pure ingest test, and obviously there needs to be headroom to query, but the take away is still valid... Spinning disks are TERRIBLE! Period! However once you move to SSDs of any kind, even just SATA or SAS, the bottleneck moves elsewhere.

Whether indexing data or aggregation queries over lots (as in 100s of millions) of documents, the bottleneck becomes CPU, and to some degree memory bandwidth. The 64-core/8-memory channel equipment we ordered should arrive this week and will allow us to finalize our conclusions. However a few trends we have seen so far is that additional memory channels can improve performance, and that beyond 64GB of RAM, additional CPU cores provide the most improvement in query performance.

As far as network goes, take a look at the above charts and you can see how easy it is to saturate 1Gbit links. However if you want to truly minimize bandwidth utilization, coordinating nodes can make a significant difference. We have some other tests that demonstrate this.

Jason_Baumgartner · August 11, 2021, 10:32am

Thank you!!!

PS: What GUI monitoring software is that?

rcowart · August 11, 2021, 11:37am

It is InfluxDB/Chronograf/Telegraf. It is the InfluxDB v1 stack. I haven't yet updated to v2. It isn't good for logs or netflow data, but is much more efficient than Elastic for timeseries metrics. Most of the Chronograf dashboards I use are here:

rcowart · August 28, 2021, 12:01pm

Adding a bit more to this thread...

It turns out that to achieve the best performance it is CRITICAL to populate all memory channels. At least that is the case for AMD processors. Although it is logical that it applies to Intel as well.

Many organizations will order servers with only a few available memory slots populated. The thinking is that they will be able to increase capacity later. Unfortunately this can have a significant negative impact on the performance of Elasticsearch, and likely other applications.

system · September 25, 2021, 12:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch data node sizing on Azure infrastructure Elasticsearch	1	512	May 15, 2018
Testbed guideline Elasticsearch	1	259	July 6, 2017
ES Hardware Profiles Elasticsearch	3	470	July 6, 2017
Node Sizing / Best Practices Elasticsearch	2	378	July 6, 2017
Elasticsearch hardware requirement,and benchmarking Elasticsearch	10	2550	July 6, 2017

The perfect Elasticsearch data node

Related topics