I have a 6 node cluster setup (3 x data nodes and 3 x master nodes). The cluster is running 6.8.0 on each node with x-pack enabled for SSL communication and security.
Each node is running on Ubuntu 18.04.2 LTS
Each data node is running on a VM with with 57GB of memory and each data node is running with a heap size of 42GB (i.e. Xms42g & Xmx42g)
Each master node is also running on a VM with 32GB of memory and set with a heap size of 21GB
Currently my cluster has 322 indexes with each index having 6 primary shards and a single replica for each shard (so each index has 12 shards in total primary/replica).
The indexers are made up of 10 main indexes which are rolled based on certain conditions such as age, size and total documents with alias used to query all indexes in a rolled series. Some indexes roll very often (based on size) while other roll when the age comes around. Some indexes are only a few MB in size, while others are around 120GB.
The system performs well when indexing new documents into the cluster, however, when you start to search and use the alias to query the system becomes unstable and the data nodes will encounter the dreaded "java.lang.OutOfMemoryError: Java heap space". The Data nodes will then go offline. However, the master nodes seem stable enough.
All my nodes have been setup with mlockall enabled.
THE QUESTION
Now clearly with the number of indexes and number of shards, I can see that more memory is needed, however, I am having a hard time gauging just what amount of memory would be ideal.
Can anyone offer some advice or potentially useful tip to improve the stability?
As Elasticsearch relies on off-heap memory and the OS page cache for performance, it is recommended to assign no more than 50% of available memory to heap and not over around 30GB in size.
Dedicated master nodes and coordinating only nodes can have more memory assigned to heap as they do not hold any data.
That is a lot of shards per node. What is the average shard size? If you look at the output of the cluster stats API, how much memory is used by segments in the cluster?
I would also recommend watching this webinar which talks about optimizing storage for heap usage. I would also recommend you look at your mappings and try to optimize them. Are you using nested documents and/or parent-child, which can drive up heap usage?
Version 7 is better at avoiding OutOfMemoryErrors, instead rejecting more requests when under memory pressure to keep the nodes alive. Can you upgrade?
Also +1 to what @Christian_Dahlqvist says: your heap size config doesn't follow the documented guidelines at all. Your heap size must not exceed 50% of your available RAM, if only to leave the other 50% for direct buffers. Moreover heaps over 32GB may have less space than smaller heaps because of pointer compression. You may get better performance with a smaller heap, because smaller heaps are cheaper to GC and because this leaves more space for the filesystem cache on which Elasticsearch relies heavily.
However it's surprising that you are hitting an OutOfMemoryError as soon as you start to search. As well as Christian's questions about your mappings, it might be worth sharing the specifics of the searches you are doing? Are you performing unreasonably large aggregations, for instance?
Thank @Christian_Dahlqvist and @DavidTurner for your responses.
I will start by bringing the allocated heap back to no more than 50% (i.e. 27 GB) and run some tests. I may be able to perform a rolling upgrade to 7 but not for the next couple of weeks.
The mappings have been optimized to ensured specific field typing is used, as well as indexers. I also ensure each index only has as a single document type and don't sub-type in a single index.
@Christian_Dahlqvist I will also come back with some stats from the cluster stats API.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.