Here's my thought. I've got servers with 256GB RAM. The optimal heap size for an ES JVM is 32GB (or just under that). How cool would it be if there was something like a "data-shared data node" (TM) that could run on the same server as my "regular" data node, but NOT need it's own copy of the shard data on disk? It would instead refer to the data already there from the "regular" data node.
This data-shared data node could use its entire heap for queries alone, and query off the data that is "owned" by the "regular" data node. I could run 1 regular and 2 shared JVMs per server, and dramatically increase my query potential without having to create new copies of the data!
The optimal heap size for an ES JVM is 64GB (or just under that)
Almost—64 GB is a good RAM size if you want to go with the recommendation of giving ~50% of the RAM to the JVM. The JVM heap should be kept below 30.5 GB to avoid uncompressed pointers.
This sounds a lot like the shadow replica functionality, although that requires the use of a shared file system for the entire cluster, not just a single server.
Similar for sure. Seems like searching a shared cluster-wide file system would be a bit on the slow side. Multiple JVMs searching the same local disk data could be super fast.
Considering the data stored on disk is not quite used as-is, the only real
benefit you would see is lower disk space utilization, which is not the
bottleneck.
Thanks to mmap/docvalues, more than just the JVM heap is used to store
process data. Perhaps if you found a way to share doc value data, then you
would have something.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.