Elasticsearch has been a core part of the infrastructure where I work for a few years. We use the AWS managed version, which unfortunately has few configuration options and also a few APIs closed.
Recently we tried switching one of our services, which has been using ES1.5 for some years now, to ES5.5 (also AWS managed). We were expecting (at least) some improvement out of the box (without having to do any additional configuration from the one we did for the ES1.5 cluster).
However, we found out it cannot hold our production load, and response times sky rocketed every time we've tried to switch (we've tried different scenarios/index configurations). Here you can see a a sample graph depicting the response time uplift we got while we the new cluster was enabled:
This shows 99th, 95th, 50th (median) percentiles and average response times.
Some details that describe our use case:
- we make ~400 bulk-indexing-requests/minute. Each of these contain at most 50 documents (each of them around 1.6KB)
- the cluster gets ~4000 requests/minute, combining filtered search queries (this service does not do any full text search) and aggregations (we don't do any nested/pipeline aggregations), just mere counts of values
- the cluster has just one index, with ~30M documents in it. The schema does not have any nested objects
- 5 nodes, 5 primary shards with 4 replicas each
- our queries can get up to 4500 hits at once, and we use source filtering to just retrieve the parts of the documents we're interested in
- in terms of hardware, we're using m3.2xlarge.elasticsearch instances, with SSD storage (the default one for these).
- we have an index refresh interval of 30 minutes
This more or less describes the setup. Something we've tried (and made a difference by bringing down response times) is disabling the query cache and the request cache (the 1.5 cluster has the query cache disabled and we have not had any issues with it).
I am posting this here in case someone has any hints on how to get to the source of this issue or in case someone has experienced something similar. AWS has closed the hot threads nodes API so we can't get anything from there.
CPU usage including the same period described in the image above (the spike corresponds to that period; 17.00h to 17.30h here):
Same, but for JVM memory pressure: