Hello,
We launched our product 10 months ago. Now it's time for us to expand, serve more customers and as a result move to a bigger cluster. Therefore, nowadays we are making some kind of sizing procedure.
Our sizing procedure was as follows (see attached illustration):
- Write data from all of our clients (instead of one) for one week to a new cluster, the new index has the same mapping as our production index. That mainly gave us the ability to examine how much data we will really index per week. It's important to us, as now we have one big index to rule them all and in the new cluster we are going to make a refactor in our system to support writing/reading from periodic indices.
- Write data again, this time after some modifications: new appropriate amount of shards (8), more cores/ram.
- After all data written, measure query performance in both production index and new index - to see that the performance stays the same although the growth of data.
Conclusions:
We found out that caching is a critical aspect in performance.
Here's the full story, buckle up!
- After indexing for one week, we started to run some queries for measuring our performance. The query performance were very poor: 4 seconds instead of ~100-200ms (!). It was shocking, as our new cluster is a beast in terms of specs compared to our production cluster, and the new index is even smaller than the production index. In addition, the run time weren't fixed, as we noticed some spikes in performance: sometimes the query took 4 seconds and sometimes the SAME query took the desired performance goal of ~100-200ms.
- After some investigation, we managed to find a possible explanation. We found out that after running the same query 3 times, there was a magnificent performance boost for all the following runs. The performance boost wasn't only for the exact query: if we replaced some value parameters with others (but maintained exactly the same query structure) - the performance were still great. That leaded us to examining ElasticSearch caching mechanism.
- We found out that ElasticSearch caching mechanism is actually perfect for our use case. Most of our queries run with
size: 0
as we mainly interested in aggregation results. In addition, we only query for exact values (all the queries contains filter bool query, with sub filter and range queries). It's important, as ElasticSearch caches both document results returned from the query (node query cache) and documents which were relevant for each sub-query in the query (shard query cache). The latter cache, in our opinion, plays greater rule. It means that if we know in advance which specific fields are relevant for most of our queries, we can run one single query which contain all of them so they will be cached. - After testing, we found a bingo. When the caching size grows (tested by using index stats), the query performance reachd the desired goal. After clearing the cache (done using clear cache), the performance were horrible again. It's also important to mention that we found out that the cache size of our production index in the production environment is 2 GB (!). By the way, apparently this is the max cache size allowed by default, since the default is 10% and 10% out of 20GB is 2GB.
- That leads us to the retrospective understanding that our sizing procedure wasn't much realistic. The real life situation is that the index grows from 0 GB to 200 GB, triggering the cache to be built. And even if the cache isn't yet available at first, it doesn't really matters as the index size is still small. In our sizing procedure, the first query run on an index with 200 GB of data with 0 GB index size. So, a more realistic scenario is to run the test queries while the data is written (during the week and not only in the end of the week).
- To finally prove our theory, we are going to duplicate the production index to a separate index in our production cluster (via reidnex / snapshot-and-restore API). That will enable us to run
clear cache
(which we certainly don't want to run on our production used index), and to examine the performance without any cache. Hopefully, the performance will be horrible without any cache - which means the case is completely closed and understood.
What's next?
We still want to get more information about the caching mechanism.
- We need to store data for the last 6 months, which means storing 26 weekly indices. Will the cache of all indices be saved all the time? Will using more RAM helps? Will changing yaml settings (such as
indices.queries.cache.size
,indices.requests.cache.size
) or disabling the caching completely in old indices in favor of new indices help? - Will it be necessary to "warm" the cache of indices by a periodic run query once in a while?
- What happens in case node restarts? Is cache of its shards gone? Will it restore from replica shards?
We shared some interesting thoughts and will happy to hear your responds.
Does all this make sense to you?
Thank you.