We're looking into maximizing Elasticsearch's performance by leveraging memory usage to its fullest potential. We hypothesize that reducing disk I/O by running Elasticsearch entirely in memory could lead to more efficient query execution, requiring less CPU resources. Are there methods or configurations available to achieve this?
Here are a few specific questions we have:
1. Running Elasticsearch in Memory: Is it possible to run Elasticsearch entirely in memory, minimizing disk I/O and potentially reducing CPU usage and need?
2. Configuration for In-Memory Usage: What configuration settings should we consider to optimize Elasticsearch for in-memory usage? Are there specific parameters or settings we should tweak to achieve this goal?
3. Estimating Memory Size based on Index Size: How can we estimate the required memory size based on the size of our indexes/shards? Are there any formulas or best practices for determining the amount of memory needed relative to the size of our data?
Any insights, experiences, or best practices you can share regarding optimizing Elasticsearch for in-memory usage would be greatly appreciated. Thank you!
Elasticsearch typically relies on the operating system page cache for performance, so in order to minimise disk I/O I would recommend the following:
Configure the smallest heap size your use case will allow as this leaves as much memory to the operating system page cache as possible. Ensure that the heap is large enough so you do not suffer from frequent or long GC.
Ensure all your indices can fit in the size of the operating system page cache, especially if you have a search use case. If you are relying on aggregations and do not need to return documents I suspect only a smaller subset of the index files may need to fit in the page cache.
In our case we have heavy write(and also read) operations. We are streaming and sink data into ealasticsearch for product updates which we have millions of them. So writing to disk is costly for us. We are trying to find faster alternative way thats why I created this topic. Thanks for your suggestion @Christian_Dahlqvist but it doesn't cover writing part right? Your approach will open more place for caching and it may improve read operations
Correct. Heavy writing will also affect the page cache so my advoce is generally geared towards read-heavy use cases.
Writes must be persisted to disk so it is hard to work around this. Have you gone through the official guidance and tried to optimise how you index data? I have seen questions around using some kind of ramdisk for storage but think this ran into different types of issues and have never seen this used in practice.
Yes we have already checked official guadiance. We need something more radical.
Can you please share the discussion address of the ramdisk storage discussions? Sounds interesting and risky.
I do not have any links as I have never seen anyone do this successfully. I have only seen this mentioned a few times and always associated with problems. You probably need to search this forum and the internet.
I would instead focus on looking at optimizing how you index data. If you can share information about your use case, your data, your transaction volumes and the specification of your cluster here someone may be able to help you identify improvements.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.