Currently, we’re trying to use rally to test how much data I can ingest in a 2 node setup (1 data node, 1 dedicated master node). Our ES setup is running in Docker. With rally we get a maximum ingestion of ~150k docs/s using the
http_logs track running in benchmark-only mode like so:
esrally --track=http_logs --pipeline=benchmark-only --target-hosts=10.2.10.200:9200
We’ve placed rally on a different host than the elasticsearch nodes. The main issue is that the ES server is mostly idle even during the test (~45% CPU usage). Incoming traffic is ~25MB/s, outgoing traffic ~15-20MB/s. We have 1Gbps connections on this machines, so even combined we’re at half of the network capacity. Disk write is ~110MB/s, testing with
fio we can easily double the disk throughput, with random writes (these are normal spinning disks). Disk % busy is ~6% during the test, and Disk % io wait time is ~2%. The conclusion is that the server is heavily underused. Initially, we thought that perhaps rally could not generate enough traffic to stress the server so we used a different rally setup (1 coordinator + 2 load driver hosts, all on different hosts) but the throughput is roughly the same. We cannot pinpoint the bottleneck in scenario and looks like ES can do more.
The ES data node in question has 128GB of RAM and 40 CPUs.
Do we have a maximum throughput that rally is capable of generating on a single client?
On https://elasticsearch-benchmarks.elastic.co/index.html#tracks/http-logs/nightly/30d I see that something ~160k docs/s looks like the number for a single rally client.