ES does not scale with Rally track http_logs


i have configured an Elasticsearch 6.2.4 cluster on Kubernetes in GCP.

9 nodes in west3a,b,c
24 CPU's and 45 GB RAM per node

3 Master nodes each with 2 CPU's and 4 GB RAM
2 Coordination nodes each with 2 CPU's and 4 GB RAM
3 Data nodes each with 4 CPU's and 16 GB RAM and standard disks

From another pod in the same Kubernetes cluster i run a single esrally (0.11.0 on Ubuntu 18.04) instance.
esrally --track=http_logs --target-hosts=es-coord.database.svc.cluster.local:443 --pipeline=benchmark-only --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'user',basic_auth_password:'pass'"

I get 148587 documents/sec and 8 default queries/sec with 196ms latency. This is not bad, but no matter what i do i don't see a linear Elasticsearch scaling.

6 data nodes: 180000 documents/sec
9 data nodes: 196000 documents/sec
3 data nodes with ssd: 152000 documents/sec
6 data nodes with ssd: 202000 documents/sec
9 data nodes with ssd: 203000 documents/sec
3 data nodes only in west3a: 145000 documents/sec
6 data nodes only in west3a: 187000 documents/sec
9 data nodes only in west3a: 182000 documents/sec
9 data nodes only in west3a on small K8s nodes: 179000 documents/sec

I also have tried to use the Coordination nodes IP's for target-hosts instead of the K8s service and i have also tried to use more Coordination nodes. Also --track-params="clients:64" and more esrally daemons has shown absolutely no difference. We use the xpack metrics and metricbeat for the containers, but i don't see a bottleneck. All involved processes are not that busy. The only suspicious thing is that i don't see more than 150Mbit/sec network throughput, but this should not affect the query results.

Do you have any ideas where the bottleneck could be?


really hard to tell. You might want to checkout Seven Tips for Better Elasticsearch Benchmarks, specifically tip 5 where I mention the USE method by Brendan Gregg to analyze bottlenecks.

I could imagine that you'd also need to experiment with different parameters like bulk size, shard count, index buffer size etc.

That could point to lock contention which is unfortunately not easy to spot without a profiler.

I am not sure I understand you correctly but other traffic on your network will affect your query latency. Also, concurrent indexing does have an effect on query performance (see also the Webinar Using Rally to Get Your Elasticsearch Cluster Size Right which is free to watch but requires prior registration).


