We were seeing some real slowness in the queries in our new ES 6.1.2 cluster. In order to understand more, we started running Rally Geoname test bed to get some performance benchmark of the cluster
The test results shows the querying is pretty slow.
Here is our test configuration that is run on AWS
|Elastic Search Version|6.1.2|
Master Node = 3 (r4.large)
Data Node = 3 (i3.2xlarge)
|Front Node = 1 (r4.2xlarge) == rally host
Rally Test Car - Geonames
java version "1.8.0_161"
Summary/Our Analysis of the data
We ran the same rally tests on the 2.4 ES version legacy 3 data node cluster using the same hardware in AWS, Our 2.x rally tests performed way better than 6.x cluster
We also ran the same rally tests on 1 data node 6.1.2 cluster. We found the performance numbers in 1 data node 6.1.2 cluster is way better than 3 data node 6.1.2 cluster, but poorer than 2.4 cluster
For the country_aggregrate uncached numbers, there is a huge difference between latency and service time. We checked the CPU utilization and all the system metrics. CPU utilization is hovering only around 40-50%.
We have run these tests multiple time over the last 1 week and found the results from rally are consistent.
We are running the basic configuration with nothing much changed in the Elasticsearch configuration for the 6.1.2 cluster.
Any pointers or thoughts on what might be going on in our cluster. We are migrating our users from 2.4 to 6.1.2 and want to get a good handle before we roll everyone to new cluster and shut down the old cluster.
I don't know if you checked already for comparison, but we have an archive of release benchmarks in the usual page (https://elasticsearch-benchmarks.elastic.co) on our bare metal environment; in particular looking at the 99th percentile service_time for the geonames track between 2.4.6 and 6.4.0 on 1node our own benchmarks show:
scroll service time is less on 6.4.0: 666.008ms vs 751.116ms on 2.4.6. country_agg_cached is basically the same (3.796ms vs 3.783ms) country_agg_uncached service time is a bit slower in 6.4 (and 5.6) giving 222.651ms compared to 190.085ms on 2.4.6 in service time, but nowhere near the 115% increase you are observing.
The first observation is that since your latency is >> service_time in your 6.1.2 Elasticsearch (and this is not observed in the 2.4 setup), the cluster is bottlenecked somewhere (see also here).
My first thought would be to check if the environment setup is precisely the same (in terms of h/w) between your 2.4 and 6.1 cluster; e.g. are you using exactly the same instance types for ES nodes and in the same region and availability zone as well?
In addition to that, is the operating system the same (inc. version) for both environments? Apart from differences arising from different kernels and settings, the i3.2xlarge instance you are using for the data node benefits from NVMe instance store, however, this can not be efficiently utilized in older Linux kernels.
You mentioned you checked the system metrics, have you in particular looked at io metrics (iostat -xz 1)? I am linking here a useful performance checklist written by Brendan Gregg for checking resource utilization.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.