Elasticsearch 1.5 -> 5.5 performance degradation

Hello,

Elasticsearch has been a core part of the infrastructure where I work for a few years. We use the AWS managed version, which unfortunately has few configuration options and also a few APIs closed.

Recently we tried switching one of our services, which has been using ES1.5 for some years now, to ES5.5 (also AWS managed). We were expecting (at least) some improvement out of the box (without having to do any additional configuration from the one we did for the ES1.5 cluster).

However, we found out it cannot hold our production load, and response times sky rocketed every time we've tried to switch (we've tried different scenarios/index configurations). Here you can see a a sample graph depicting the response time uplift we got while we the new cluster was enabled:

percentiles

This shows 99th, 95th, 50th (median) percentiles and average response times.

Some details that describe our use case:

  • we make ~400 bulk-indexing-requests/minute. Each of these contain at most 50 documents (each of them around 1.6KB)
  • the cluster gets ~4000 requests/minute, combining filtered search queries (this service does not do any full text search) and aggregations (we don't do any nested/pipeline aggregations), just mere counts of values
  • the cluster has just one index, with ~30M documents in it. The schema does not have any nested objects
  • 5 nodes, 5 primary shards with 4 replicas each
  • our queries can get up to 4500 hits at once, and we use source filtering to just retrieve the parts of the documents we're interested in
  • in terms of hardware, we're using m3.2xlarge.elasticsearch instances, with SSD storage (the default one for these).
  • we have an index refresh interval of 30 minutes

This more or less describes the setup. Something we've tried (and made a difference by bringing down response times) is disabling the query cache and the request cache (the 1.5 cluster has the query cache disabled and we have not had any issues with it).

I am posting this here in case someone has any hints on how to get to the source of this issue or in case someone has experienced something similar. AWS has closed the hot threads nodes API so we can't get anything from there.

CPU usage including the same period described in the image above (the spike corresponds to that period; 17.00h to 17.30h here):
58

Same, but for JVM memory pressure:

There are a few threads on similar things in the recent past (including 2.X > 5.X), so doing a search will give you a few ideas :slight_smile: But things like translog changes will likely impact you.

Have you considered increasing your bulk request sizings?

Have you raised this with AWS support as well?

We just tried increasing bulk size to ~4MB per request, and it did not help unfortunately. I will try raising with AWS support. Do you have any other suggestions on things to try?

What's interesting is that without any load, aggregations and simple filtered queries are much faster on this new version than on 1.5.

Ok, we found something interesting. The amount of documents we ask ES to return has a bigger impact on 5.5 than on 1.5.

For the first request (no caches), it's up to 2X faster on 1.5 than it is on 5.5 to retrieve 4.5k documents (all our filtered queries request this size).

Any thoughts/recommendations on what we could to to lower response times while still returning these many documents?

Can you break down a few things;

  • Infrastructure
  • Any non-default settings you've applied
  • The queries/aggs
  • Mappings may be helpful

Also can you try this on a fully functional version of Elasticsearch? Elastic Cloud for eg, or even an on-prem, just to rule out the platform.

Thanks for getting back to me, Mark.

Unfortunately, due to time/resources constraints we cannot allocate additional time into this research. We will definitely get back to it once we have more time (and try on a bare-metal cluster instead of AWS hosted).

For the time being, we are staying with 1.5, as it gives us the performance we need.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.