Hello, we are seeing some memory issues in our cluster, and data nodes will sometimes crash when running out of memory.
Some quick background of our elasticsearch cluster running in GKE with the following nodes:
- 3x Master nodes
- roles: master
- max memory 2Gi
- 2x Client nodes
- roles: remote_cluster_client
- java options: Xmx31g Xms31g
- max memory 38Gi
- 3x Data nodes
- roles: master,data,data_content,data_hot
- java options: Xmx31g Xms31g
- max memory 64Gi
The main index we're working with has 100 primary shards with 1 replica, consisting of 618gb of (unreplicated) data, so 1.2tb of total data.
There are 1,143,317 documents, some large (~100MB of text), some small (<1KB of text). Since we have to write data frequently, we've split the document's text into a child document, linking the parent document to the child with a "join" field.
output of _cat/indices:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open index_3 uItMRfInQGSI7jKuzltKZw 100 1 4040608 1640051 1.2tb 618gb
When writing data to existing documents, we are seeing consistent memory issues. Occasionally, the data nodes will run out of memory and crash, outputting a memory dump file.
We've attempted to analyze this dump file using Eclipse MAT, and found that over 95% of the memory usage was used by Lucene's "ConcurrentMergeScheduler$MergeThread".
47 instances of “org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread”, loaded by “jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x10106b188” occupy 22,880,507,664 (96.35%) bytes.
Looking at the largest objects, we're seeing loads of byte[]
objects holding document information, some of which includes "text" documents, and some "metadata" documents (see screenshot below). Document ids suffixed with "_text" are the child text documents.
When writing data to parent documents that have a join field, does it also have to rewrite the child documents as well? Any other insights would be greatly appreciated.