Long story short - last year we tried to migrate from Elasticsearch 1.7 to 2.4 but we noticed quite significant performance degradation (details in this thread). We tried all suggested changes but we were not able to match ES 1.7 performance (for our queries/traffic). That said, we were clearly abusing some of ES functionalities: we were using terms filter with massive amount of terms and really complex nested filters. We removed these -- replaced the massive terms filter with simple boolean document flag and single term filter and flattened our "nested structures". After that we decided to try second time, this time with latest ES 5.x (5.2.2 at that time). We aligned all our tooling to new ES APIs (mapping, settings, clients, custom scorer plugin, etc) and load tested 5 nodes cluster with recorded live traffic.
Unfortunately, we still see ES 1.7 performing better (much better) for our workload. We can easily handle 375 req/s with 5 node ES 1.7 cluster, but we had a problem to serve 100 req/s with same size 5 node ES 5.2 cluster. We also noticed that it takes much longer for ES 5.x to warm up properly and that it's much more sensitive to sudden increase of request rate.
In order to solve performance issue we tried following tweaks:
- We tuned all configuration parameters according to documentation: we set proper heap size, etc
- We observed high query cache misses (hard to compare to ES 1.7 since it doesn't expose this metric) and we wrote custom caching strategy, that replicates our old (hand tuned) caching. It improved caching, but it didn’t improve performance. Our guess is that filter caching is not a bottleneck
- We backported an old hybrid store mechanism from Elasticsearch 1.7. No difference.
- We tried preloading various data into filesystem cache (
Most of the system metrics look much better on 5.x cluster (less disk reads, due to smaller heap and better filesystem cache utilization) just load and latencies look worse.
We see that the most problematic for new ES cluster are queries formed from user input that results in many tokens (in most extreme case some clients send double url-encoded strings which result in many small tokens -- letters and numbers) but we don't know what's the real bottleneck.
- queries sample (two, tab separated columns of compact JSON, ES 1.7 on left and 5.2 on right).
- ES 1.7 hot threads
- ES 5.2 hot threads
Thanks in advance for any ideas how to improve our ES 5.2 cluster performance.