ES 6.1.2 Cluster shows performance bottleneck

Hello,

I don't know if you checked already for comparison, but we have an archive of release benchmarks in the usual page (https://elasticsearch-benchmarks.elastic.co) on our bare metal environment; in particular looking at the 99th percentile service_time for the geonames track between 2.4.6 and 6.4.0 on 1node our own benchmarks show:

scroll service time is less on 6.4.0: 666.008ms vs 751.116ms on 2.4.6.
country_agg_cached is basically the same (3.796ms vs 3.783ms)
country_agg_uncached service time is a bit slower in 6.4 (and 5.6) giving 222.651ms compared to 190.085ms on 2.4.6 in service time, but nowhere near the 115% increase you are observing.

The first observation is that since your latency is >> service_time in your 6.1.2 Elasticsearch (and this is not observed in the 2.4 setup), the cluster is bottlenecked somewhere (see also here).

My first thought would be to check if the environment setup is precisely the same (in terms of h/w) between your 2.4 and 6.1 cluster; e.g. are you using exactly the same instance types for ES nodes and in the same region and availability zone as well?

In addition to that, is the operating system the same (inc. version) for both environments? Apart from differences arising from different kernels and settings, the i3.2xlarge instance you are using for the data node benefits from NVMe instance store, however, this can not be efficiently utilized in older Linux kernels.

You mentioned you checked the system metrics, have you in particular looked at io metrics (iostat -xz 1)? I am linking here a useful performance checklist written by Brendan Gregg for checking resource utilization.

Dimitris