Currently, we’re trying to use rally to test how much data I can ingest in a 2 node setup (1 data node, 1 dedicated master node). Our ES setup is running in Docker. With rally we get a maximum ingestion of ~150k docs/s using the http_logs track running in benchmark-only mode like so:
We’ve placed rally on a different host than the elasticsearch nodes. The main issue is that the ES server is mostly idle even during the test (~45% CPU usage). Incoming traffic is ~25MB/s, outgoing traffic ~15-20MB/s. We have 1Gbps connections on this machines, so even combined we’re at half of the network capacity. Disk write is ~110MB/s, testing with fio we can easily double the disk throughput, with random writes (these are normal spinning disks). Disk % busy is ~6% during the test, and Disk % io wait time is ~2%. The conclusion is that the server is heavily underused. Initially, we thought that perhaps rally could not generate enough traffic to stress the server so we used a different rally setup (1 coordinator + 2 load driver hosts, all on different hosts) but the throughput is roughly the same. We cannot pinpoint the bottleneck in scenario and looks like ES can do more.
The ES data node in question has 128GB of RAM and 40 CPUs.
Do we have a maximum throughput that rally is capable of generating on a single client?
I cannot say much about the Docker part, maybe you have issues with the storage driver that you're choosing (i.e. avoid unionfs)? I'd try to reduce the number of layers involved and reproduce it on bare metal.
Based on what you describe I'd probably check next for lock contention. I usually do this by attaching a profiler but you should be able to see this as well in the number of (involuntary) context switches although, without a baseline to compare with, it might be hard to make sense of the number that you see. In general I suggest you follow the USE method to pinpoint the bottleneck. Given the number of CPUs on your system, I guess it is a NUMA system so you could also check your current NUMA configuration.
That depends on the hardware that you have and also the track that you're using. I don't have any numbers OTOH but we did test this a while back on our nightly benchmarking hardware by connecting Rally to a remote nginx that returned dummy responses and the throughput we got was way beyond what can be achieved in a benchmark against a real Elasticsearch node. One important thing though: You need to place the data files for Rally on an SSD because due Rally's disk access pattern a spinning disk might indeed cause a client-side bottleneck.
Thanks for the response. We don't see any problem related to context switches in our monitoring, during the benchmark we see increased number of Context Switches (~40k/s, see https://cl.ly/d2e023099be5). We're collecting metrics with collectd from the physical host (outside of Docker).
We kind of pinpointed part of the issue: the number of concurrent rally indexing clients (bulk_indexing_clients). Looks like with the default 8 clients is really no way of stressing this server.` We tested with increasing the number of indexing clients to 32/64/96 and it is looking a bit better.
We were able to index 200k docs/s with 2 distributed rally load drivers indexing directly to this server. Still, we were not able to use the full throughput of the disk (~700MB/s). CPU was around 75% during the benchmark. And network usage increased to around 600Mbps (300Mbps in, and 300Mbps out).
What you mentioned about rally needing to run on an SSD (for better performance) is interesting, because in our case we're running rally in a server with the same specs (no docker). But we don't see a significant disk usage (not even reading). And the 2 rally load drivers are mostly idle as well.
This is a bit more than what we see on the default test (from the nightly benchmarks for the same track, which is normally ~160k docs/s). But not sure if it would be possible to generate a more intense load onto this server.
In this scenario do you think that adding a new rally load driver would make a difference? I haven't tried with a profiler just yet, because the values that we get from the monitoring don't show any saturation on our cluster yet.
The data files are probably in your page cache? If the files need to be read from a spinning disk, this could be problematic as each client gets an offset into the file and reads from there concurrently with other clients. You could try dropping the page cache (with echo 3 > /proc/sys/vm/drop_caches as root) and see if something changes.
In your original post you mentioned that you tried adding a second load driver host but it did not really help? I really wonder whether you have a client-side bottleneck. You could start checking this by running against nginx that just returns dummy responses for bulk requests (see e.g. nginx_es_bulk_stub.conf · GitHub). As you'd need to mock all API calls but are only interested in measuring bulk-indexing throughput, I suggest you use task filtering by specifying --include-tasks="index-append" when invoking Rally. You can see how much throughput the load driver can achieve taking Elasticsearch out of the equation.
Exactly, but in this case if it's reading from the page cache then putting rally on an SSD machine will not provide an additional benefit right?
I followed your advise and ran the stub/dummy _bulk endpoint with nginx. Using one single rally client (1 coordinator, 1 load driver). I can saturate the 1Gbps connection of the server:
| All | Min Throughput | index-append | 610310 | docs/s |
| All | Median Throughput | index-append | 618889 | docs/s |
| All | Max Throughput | index-append | 619775 | docs/s |
So this would by the maximum through that a single ES data node can handle with 1 load driver from rally and 300 bulk_indexing_clients using the http_logs track. Of course when indexing to ES I get a bit less than that throughput at ~289k docs/s.
Thanks for the help, now we know that at least network bound we could do ~618k docs/s (for the http_logs collection).
A bit of background: We are testing the difference in throughput between RAID0, RAID10 in our scenario. For RAID10 the current limit (considering disk IO) is close to ~150MB/s using rally. A bit higher with documents with a larger payload size (we've seen numbers close to 220MB/s which relates to ~90% of Disk busy).
Yes, that's correct. If the data set is completely cached then you shouldn't see a difference. I just wanted to point out that a spinning disk might be problematic if you use multiple clients.
Based on what you describe I still wonder whether lock contention is causing your bottleneck. I still think that the best option to spot this, is to attach a profiler to Elasticsearch while you are running the benchmark. Then this becomes pretty evident when you look at the thread states of the bulk indexing threads.