Hello,
I don't know if you checked already for comparison, but we have an archive of release benchmarks in the usual page (https://elasticsearch-benchmarks.elastic.co) on our bare metal environment; in particular looking at the 99th percentile service_time for the geonames
track between 2.4.6
and 6.4.0
on 1node our own benchmarks show:
scroll
service time is less on 6.4.0
: 666.008ms
vs 751.116ms
on 2.4.6
.
country_agg_cached
is basically the same (3.796
ms vs 3.783
ms)
country_agg_uncached
service time is a bit slower in 6.4 (and 5.6) giving 222.651ms
compared to 190.085ms
on 2.4.6
in service time, but nowhere near the 115% increase you are observing.
The first observation is that since your latency is >> service_time in your 6.1.2 Elasticsearch (and this is not observed in the 2.4 setup), the cluster is bottlenecked somewhere (see also here).
My first thought would be to check if the environment setup is precisely the same (in terms of h/w) between your 2.4 and 6.1 cluster; e.g. are you using exactly the same instance types for ES nodes and in the same region and availability zone as well?
In addition to that, is the operating system the same (inc. version) for both environments? Apart from differences arising from different kernels and settings, the i3.2xlarge
instance you are using for the data node benefits from NVMe instance store, however, this can not be efficiently utilized in older Linux kernels.
You mentioned you checked the system metrics, have you in particular looked at io metrics (iostat -xz 1
)? I am linking here a useful performance checklist written by Brendan Gregg for checking resource utilization.
Dimitris