We recently did a test migration of one of ours Elasticsearch clusters from 1.7.3 to 2.4.0 which involved a lot of changes in pretty much every component of our core search infrastructure - system provisioning, indexing pipeline, querying layer and custom scoring plugin. We adjusted our config, mappings (unified fields across different types) and query generation (replaced already deprecated facets with aggregations, removed hand-tuned execution mode of from range filters).
To give you bit more insights about our setup and some numbers:
- we use 5 shards, with 7 replicas distributed among 20 nodes, which gives 2 shards per node
- we have over 480M documents indexed, which makes about 200M documents and 43GB of data stored in each node
- we run one ES node per physical machine (16 cores, 32 threads total, SSD, 64GB of RAM) and give 25GB to ES heap (we used to give 30GB to ES 1.7, but ES 2.x seems to use less heap).
- the rest of our setup is IMHO pretty standard
After initial load tests we noticed an increased average response time in the ES 2.4 cluster. Looking at the differences between our ES 1.7 and 2.4 clusters, we noticed that:
- ES 2.4 does not load fielddata on fields where we defined it (we later enforced with
fielddata.loading = eager)
- ES 2.4 decides itself what to cache and what not; its
query_cacheis much smaller than the
filter_cachewe have in our ES 1.7 cluster (which we had previously set to 20%). There are very few evictions but a lot of cache misses (but this one is hard to compare, since ES 1.7 seems not to expose
filter_cachemisses in stats).
We also decided to try mmapfs in our ES 2.4 cluster, since the memory we leave to the OS (39GB) is close to our node index data (43GB).
We run our load tests by replaying live traffic (that we previously recorded) at the rate that the other cluster receives at our peak traffic time (~1200 reqs/s). Then, we compare the metrics we record. We observed the following:
ES 2.4 cluster gives ~30% longer response time (mean and upper 90th percentile, as defined in StatsD)
Increase of hard timeouts: in addition to our main index, we have a large terms filter (or ids filters) with hundreds of thousands of items that we use together with a filtered alias. Querying this index seems to cause so much work in the cluster that some requests start taking longer than 1.1s which our calling component considers a “hard timeout”. This happens despite the fact that we set the
timeoutvalue for best effort responses to 400ms.
All other system metrics (load, wait IO, disk reads, GC pauses) look better in ES 2.4 cluster.
Thanks in advance for any ideas how to solve it.