What kind of optimizations is Elastic using on their benchmarking hosts? I have a bare-metals server, with 20 cpu cores, 64 gb of ram, and a mounted nvme drive, but when I use rally locally with
esrally --distribution-version=6.4.0 --track=nyc_taxis --car="4gheap"
I am getting at most about 58k docs/s, while elastic is reporting around 80k docs/s for their add-4g test (single node).
I have gone through the important system configurations. But there is still about a 20k docs/s difference. Based on what is said in the benchmarking methodology and environment, I am doubtful that there is a hardware difference that is making this significant of a difference.
Is there something that I am missing?
Thanks.
What type of disk do you have?
I believe we are running Rally on a separate host and are using 10G networking, so if you are running Rally on the Elasticsearch host that may perhaps explain the difference. What does CPU usage and disk I/O look like during indexing?
Hi,
what Christian is saying is correct. We do have the load test driver on a dedicated machine. Please check https://elasticsearch-benchmarks.elastic.co/ for the detailed hardware and software configuration. We intentionally run with stock configuration as much as possible so we also don't do any kernel tuning for example (apart from the changes that are required to run Elasticsearch and that you've mentioned as well in your original post). There is only one exception: We turn on transparent huge pages and the reason is that in earlier kernel versions (IIRC before 4.12.2) this was set to always and changed to madvise and we have only "tuned" this so the historic results are better comparable.
Before every benchmark we run a setup routine for better reproducible results. We always setup a fresh file system on the disk, issue a TRIM and drop the page cache. See Is your Elasticsearch TRIMmed for more background info.
I'd start by putting the load test driver on a dedicated machine. As a next step I'd look for bottlenecks (Seven tips for better Elasticsearch benchmarks has some pointers).
Daniel
@Christian_Dahlqvist that could be it. I noticed that during index, some python3 processes would sometimes have spikes in io when it is reading the dataset, plus there were a few esrally processes that were taking up a small, but not negligible amount of cpu during the index.
I'll try to get a 10G connection between two servers so I can test that idea. When I tried a remote session I did not get much different results, but the load test driver was on a 1G connection.
Thanks.
I used iotop, and this is a snippet of what it showed during index
Total DISK READ : 17.91 M/s | Total DISK WRITE : 139.19 M/s
Actual DISK READ: 17.91 M/s | Actual DISK WRITE: 40.24 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
16919 be/4 elksvr 0.00 B/s 2.24 M/s 0.00 % 1.67 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16918 be/4 elksvr 0.00 B/s 3.45 M/s 0.00 % 0.20 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16904 be/4 elksvr 0.00 B/s 5.27 M/s 0.00 % 0.17 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16907 be/4 elksvr 0.00 B/s 3.55 M/s 0.00 % 0.14 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16909 be/4 elksvr 0.00 B/s 3.42 M/s 0.00 % 0.10 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16923 be/4 elksvr 0.00 B/s 3.86 M/s 0.00 % 0.10 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16911 be/4 elksvr 0.00 B/s 4.48 M/s 0.00 % 0.06 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
894 be/3 root 0.00 B/s 1037.65 K/s 0.00 % 0.04 % [jbd2/nvme0n1p1-]
16912 be/4 elksvr 0.00 B/s 2.93 M/s 0.00 % 0.03 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16924 be/4 elksvr 0.00 B/s 2.94 M/s 0.00 % 0.02 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16916 be/4 elksvr 0.00 B/s 2028.84 K/s 0.00 % 0.01 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16946 be/4 elksvr 0.00 B/s 75.21 M/s 0.00 % 0.01 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16957 be/4 elksvr 0.00 B/s 24.09 M/s 0.00 % 0.01 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16888 be/4 elksvr 4.60 M/s 0.00 B/s 0.00 % 0.00 % python3 /usr/local/bin/esra~track=nyc_taxis --car=4gheap
16890 be/4 elksvr 4.36 M/s 0.00 B/s 0.00 % 0.00 % python3 /usr/local/bin/esra~track=nyc_taxis --car=4gheap
16892 be/4 elksvr 4.48 M/s 0.00 B/s 0.00 % 0.00 % python3 /usr/local/bin/esra~track=nyc_taxis --car=4gheap
16893 be/4 elksvr 4.48 M/s 0.00 B/s 0.00 % 0.00 % python3 /usr/local/bin/esra~track=nyc_taxis --car=4gheap
16906 be/4 elksvr 0.00 B/s 367.82 K/s 0.00 % 0.00 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16908 be/4 elksvr 0.00 B/s 379.44 K/s 0.00 % 0.00 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16913 be/4 elksvr 0.00 B/s 1653.27 K/s 0.00 % 0.00 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16917 be/4 elksvr 0.00 B/s 491.72 K/s 0.00 % 0.00 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
16922 be/4 elksvr 0.00 B/s 1970.76 K/s 0.00 % 0.00 % java -Xms4g -Xmx4g -XX:+Use~arch.bootstrap.Elasticsearch
and for cpu usage: top
top - 10:44:28 up 5 days, 2:32, 4 users, load average: 10.07, 8.65, 4.83
Tasks: 271 total, 1 running, 269 sleeping, 0 stopped, 1 zombie
%Cpu(s): 49.0 us, 1.5 sy, 0.0 ni, 49.4 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65940932 total, 7513392 free, 5570352 used, 52857188 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 59753956 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16666 elksvr 20 0 16.953g 5.491g 967124 S 925.9 8.7 91:51.84 java
16887 elksvr 20 0 185568 62184 5512 S 12.0 0.1 1:08.50 esrally
16882 elksvr 20 0 185576 62016 5512 S 11.6 0.1 1:10.67 esrally
16881 elksvr 20 0 185824 71728 5496 S 10.6 0.1 1:08.63 esrally
16883 elksvr 20 0 185836 62372 5512 S 10.6 0.1 1:10.57 esrally
16880 elksvr 20 0 185308 72400 5508 S 10.3 0.1 1:06.01 esrally
16885 elksvr 20 0 185816 71788 5512 S 9.6 0.1 1:08.91 esrally
16884 elksvr 20 0 185304 61740 5512 S 9.0 0.1 1:10.60 esrally
16886 elksvr 20 0 185564 62276 5512 S 7.6 0.1 1:08.21 esrally
16184 root 20 0 57120 16044 7516 S 5.0 0.0 0:31.51 iotop
16291 elksvr 20 0 40632 3888 3164 R 0.7 0.0 0:01.76 top
Hi,
For single node benchmarks a 1GB connection is usually fine. You should check though whether the network is saturated. It's just important that you avoid resource contention between Elasticsearch and Rally. Also, loopback behaves a bit differently than Ethernet (e.g. different MTU, different code paths in the kernel).
Daniel