Cache Impacts Performance

Hello,

We launched our product 10 months ago. Now it's time for us to expand, serve more customers and as a result move to a bigger cluster. Therefore, nowadays we are making some kind of sizing procedure.

Our sizing procedure was as follows (see attached illustration):

  1. Write data from all of our clients (instead of one) for one week to a new cluster, the new index has the same mapping as our production index. That mainly gave us the ability to examine how much data we will really index per week. It's important to us, as now we have one big index to rule them all and in the new cluster we are going to make a refactor in our system to support writing/reading from periodic indices.
  2. Write data again, this time after some modifications: new appropriate amount of shards (8), more cores/ram.
  3. After all data written, measure query performance in both production index and new index - to see that the performance stays the same although the growth of data.

Conclusions:
We found out that caching is a critical aspect in performance.
Here's the full story, buckle up!

  • After indexing for one week, we started to run some queries for measuring our performance. The query performance were very poor: 4 seconds instead of ~100-200ms (!). It was shocking, as our new cluster is a beast in terms of specs compared to our production cluster, and the new index is even smaller than the production index. In addition, the run time weren't fixed, as we noticed some spikes in performance: sometimes the query took 4 seconds and sometimes the SAME query took the desired performance goal of ~100-200ms.
  • After some investigation, we managed to find a possible explanation. We found out that after running the same query 3 times, there was a magnificent performance boost for all the following runs. The performance boost wasn't only for the exact query: if we replaced some value parameters with others (but maintained exactly the same query structure) - the performance were still great. That leaded us to examining ElasticSearch caching mechanism.
  • We found out that ElasticSearch caching mechanism is actually perfect for our use case. Most of our queries run with size: 0 as we mainly interested in aggregation results. In addition, we only query for exact values (all the queries contains filter bool query, with sub filter and range queries). It's important, as ElasticSearch caches both document results returned from the query (node query cache) and documents which were relevant for each sub-query in the query (shard query cache). The latter cache, in our opinion, plays greater rule. It means that if we know in advance which specific fields are relevant for most of our queries, we can run one single query which contain all of them so they will be cached.
  • After testing, we found a bingo. When the caching size grows (tested by using index stats), the query performance reachd the desired goal. After clearing the cache (done using clear cache), the performance were horrible again. It's also important to mention that we found out that the cache size of our production index in the production environment is 2 GB (!). By the way, apparently this is the max cache size allowed by default, since the default is 10% and 10% out of 20GB is 2GB.
  • That leads us to the retrospective understanding that our sizing procedure wasn't much realistic. The real life situation is that the index grows from 0 GB to 200 GB, triggering the cache to be built. And even if the cache isn't yet available at first, it doesn't really matters as the index size is still small. In our sizing procedure, the first query run on an index with 200 GB of data with 0 GB index size. So, a more realistic scenario is to run the test queries while the data is written (during the week and not only in the end of the week).
  • To finally prove our theory, we are going to duplicate the production index to a separate index in our production cluster (via reidnex / snapshot-and-restore API). That will enable us to run clear cache (which we certainly don't want to run on our production used index), and to examine the performance without any cache. Hopefully, the performance will be horrible without any cache - which means the case is completely closed and understood.

What's next?
We still want to get more information about the caching mechanism.

  • We need to store data for the last 6 months, which means storing 26 weekly indices. Will the cache of all indices be saved all the time? Will using more RAM helps? Will changing yaml settings (such as indices.queries.cache.size, indices.requests.cache.size) or disabling the caching completely in old indices in favor of new indices help?
  • Will it be necessary to "warm" the cache of indices by a periodic run query once in a while?
  • What happens in case node restarts? Is cache of its shards gone? Will it restore from replica shards?

We shared some interesting thoughts and will happy to hear your responds.
Does all this make sense to you? :cowboy_hat_face:
Thank you.

1 Like

Any help would be much appreciated :slight_smile: .
I do see that it is mentioned here, that:

With the request cache disabled, requests to elasticsearch for the dashboard to refresh takes 12s-14s. When the request cache is enabled this drops to ~100ms. The same request is now 100x faster!

First of all, even though it is not related to the caching questions, I would like to point out that running a cluster with only 2 nodes is not recommended as it does not offer HA. I would recommend adding a third node so you have 3 master-eligible nodes in the cluster, either another master/data node or a smaller dedicated master node.

As long as an index is being written to, the cache will bet invalidated and have to be rebuilt. As you appear to have a quite large index covering a reasonably long period you are likely to notice this as the last data most often is the one queried.

Given the data volumes it is possible you would benefit from using a daily index with a primary shard per data node (in this case 2). You will get more indices over the 6 months, but the amount of data per node that is queried without efficient caching will shrink. You can also use the rollover index API to create new indices based on a combination of age and/or size. This allows you to get shards of an even size even if the data volume ingested has peaks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.