Performance degradation after ES upgrade (v1.7 -> v2.4)

We recently did a test migration of one of ours Elasticsearch clusters from 1.7.3 to 2.4.0 which involved a lot of changes in pretty much every component of our core search infrastructure - system provisioning, indexing pipeline, querying layer and custom scoring plugin. We adjusted our config, mappings (unified fields across different types) and query generation (replaced already deprecated facets with aggregations, removed hand-tuned execution mode of from range filters).

To give you bit more insights about our setup and some numbers:

  • we use 5 shards, with 7 replicas distributed among 20 nodes, which gives 2 shards per node
  • we have over 480M documents indexed, which makes about 200M documents and 43GB of data stored in each node
  • we run one ES node per physical machine (16 cores, 32 threads total, SSD, 64GB of RAM) and give 25GB to ES heap (we used to give 30GB to ES 1.7, but ES 2.x seems to use less heap).
  • the rest of our setup is IMHO pretty standard

After initial load tests we noticed an increased average response time in the ES 2.4 cluster. Looking at the differences between our ES 1.7 and 2.4 clusters, we noticed that:

  • ES 2.4 does not load fielddata on fields where we defined it (we later enforced with fielddata.loading = eager)
  • ES 2.4 decides itself what to cache and what not; its query_cache is much smaller than the filter_cache we have in our ES 1.7 cluster (which we had previously set to 20%). There are very few evictions but a lot of cache misses (but this one is hard to compare, since ES 1.7 seems not to expose filter_cache misses in stats).

We also decided to try mmapfs in our ES 2.4 cluster, since the memory we leave to the OS (39GB) is close to our node index data (43GB).

We run our load tests by replaying live traffic (that we previously recorded) at the rate that the other cluster receives at our peak traffic time (~1200 reqs/s). Then, we compare the metrics we record. We observed the following:

  • ES 2.4 cluster gives ~30% longer response time (mean and upper 90th percentile, as defined in StatsD)

  • Increase of hard timeouts: in addition to our main index, we have a large terms filter (or ids filters) with hundreds of thousands of items that we use together with a filtered alias. Querying this index seems to cause so much work in the cluster that some requests start taking longer than 1.1s which our calling component considers a “hard timeout”. This happens despite the fact that we set the timeout value for best effort responses to 400ms.

  • All other system metrics (load, wait IO, disk reads, GC pauses) look better in ES 2.4 cluster.

Thanks in advance for any ideas how to solve it.


It would help if you could tell more about this performance degradation:

  • Did you reindex the data, or is it a 1.7 index that was imported in 2.4? If it was reindexed, does the index have similar disk usage?
  • Is there a common structure to your queries? If yes, could you share it?
  • Could you try to capture hot threads while your load test is running so that we can try to spot the bottlenecks? Ideally you would catpure it eg. 5 times at 10 seconds of interval.

we have a large terms filter (or ids filters) with hundreds of thousands of items that we use together with a filtered alias. Querying this index seems to cause so much work in the cluster that some requests start taking longer than 1.1s which our calling component considers a “hard timeout”. This happens despite the fact that we set the timeout value for best effort responses to 400ms.

This is a known issue unfortunately, large terms queries perform lots of lookups in the terms dictionary, and Elasticsearch only starts checking the timeout after all terms have been looked up.

we use 5 shards, with 7 replicas distributed among 20 nodes, which gives 2 shards per node

With 2.x we switched from async translog fsyncs to per-request fsync. This is mainly to eliminate the 5 seconds window of potential dataloss on the translog. This is something that is not important IMO if you run with that many replicas. If you switch it off you might see some improvements in indexing which can result in a potentially different number of segments. This might also have quite an impact on query performance after all. I wonder if you have a similar number of segments per shard in 1.7 and 2.4? Do you optimize your index down to 1 segment when you run your pref tests?

Yes, we did full reindex.
Disk reads are actually much better in ES 2.4, since we reduced its heap and gave more memory to OS. And since we switched to mmapfs I see hardly any disk reads in ES 2.4 cluster.
We constantly index new documents to our clusters ("streamfeeding") and since ES 2.x switched to per-request fsync, disk writes were actually worse (more of them) in ES 2.4 cluster. That's said I've just switched index.translog.durability = async as @s1monw suggested and now the disk writes are on pair.

Yes, we use filtered query with custom function_score (custom scorer plugin) and various filters (mostly term and range filters, with user defined values). The query part is common and looks like this.

Here you go.

I changed translog durability to async and noticed decrease of disk writes but not much improvement on query performance.

No, I call optimize with default params (without specifying max_num_segments) just before my load tests. We also want our testa to reflect the real cluster usage so we "streamfeed" it at the same rate as the other, online cluster (~35 docs/s).

Thanks. The hot threads are interesting since Elasticsearch seems to spend significant time in operations that only run once per segment, like creating weights (which requires to look up terms in the terms dictionary in order to read index statistics) and interacting with the filter cache. The other thing that seems to take time is DiscoRankScorer.docKind/docId. On the other hand, very little time seems to be spent reading postings lists and combining them (eg. conjunctions/disjunctions).

There are a few changes that I am thinking of that could make things a bit better like LUCENE-7311 (so that cached term filters do not seek the terms dict) or LUCENE-7235/LUCENE-7237 (to prevent interactions with the cache to be bottlenecks) but unfortunately they will only be available in 5.0.

Am I correct that most search requests match few documents and return very quickly (ie. in a couple of milliseconds)? That would explain why hot threads suggest that more time seems is spent getting ready to search than actually searching. As a side note that would also mean that you should be able to reach a better throughput by having fewer shards since we would run fewer of these once-per-segment operations (however it would hurt latency a bit since queries would be less parallelized).

When you were on 1.7, were you already using doc values on the fields that are used by the DiscoRankScorer.docKind/docId methods or in-memory fielddata?

If that is something that is easy to test for you, it would be interesting to see what happens with response times if you disable the cache entirely by putting the undocumented index.queries.cache.type: none in the index settings. The index will need to be reopen for it to be taken into account, please also run a couple queries and check that the filter cache stats remain at 0 to be sure it was taken into account:

POST index/_close

PUT index/_settings
  "index": {
    "queries.cache.type": "none"

POST index/_open

Yeah, I think that's it. Discorank class from our scoring plugin extends AbstractDoubleSearchScript, and docKind / docId methods call doc().get(fieldname) with "_type" and "id" field names respectively. I've just looked at _cat/fielddata and indeed, these fields are there in ES 1.7 cluster but not in ES 2.4.

As I understand doc values are quite fast (mmap) but still not as fast as fielddata (local memory) and for situations where they're used that often (two lookups for each matched document) that might make the difference.

The only change we've done to the plugin was to align with new ES API. The mentioned fields never had fielddata defined in their mapping so I assume ES 1.x used to build fielddata for these fields automatically (?).

I can enforce eager loading of fielddata of "id" field, but AFAIK ES 2.x forbids to mess with special fields like "_type". Should I copy it to another field then?

I also noticed a bug in _cat/fielddata api. Each fielddata has an "id" property (first column) and when there's actually an "id" field that has a fielddata, weird things are happening.

I'll enforce fielddata loading on "kind" and "id" tomorrow we'll see if that helps.

I think the main problem here could be the _type field. One major difference between doc values and fielddata is that doc values optimize access to ordinals but assume that it is ok for the ordinal -> value lookup to be slow: they assume that algorithms can work on ordinals and take advantage of the fact that this lookup can be slow in order to perform better compression. When you get the value of the _type field through, it first looks up the ordinal and then the value. I'm wondering whether it could be changed to work on ordinals instead? If the number of types is small it could actually load them in memory once per segment and then only read ordinals for each document. That would probably be a significant change to your script so I'm wondering that we might be able to assess the impact of this change without actually doing it by just hardcoding the value of the _type field in the script and see how performance goes.

That might be a good idea. For the record, types are going to be removed long term. Remove support for types? · Issue #15613 · elastic/elasticsearch · GitHub

Correct. It builds fielddata by "uninverting" the inverted index in memory.

If all the fields you use already have doc values, I don't think loading fielddata eagerly would help much. This is more useful for in-memory fielddata since uninversion can be a costly operation.

you might also consider to use some kind of type ordinal as an additional numeric field that can be used instead of the type itself. This would prevent the entire ordinal dance will give you better lookup performance. I assume you use the type to disambiguate the ID in your scoring plugin?
If possible then I guess you should also add a second numeric field with doc_values only (not indexed etc.) for the ID that would allow you to do something like this score(doc.get("type_id"), doc.get("numeric_id")) such that you don't need to do any string conversion or lookups at all?

Hey, sorry for the slow feedback.

Maybe it's relayed (or maybe not) but we found a bug in our scoring plugin (we had needScores=false which disabled document scores, which we really need). But it's fixed now.

I've done that. I've ran another round of load tests and I don't see much improvement.

Response time of ES 1.7 cluster:

Response time of ES 2.4 cluster:

Here are the latest hot threads.

I'll try that and post the results here.