I am trying to architect a cluster optimized for search speed. Right now we have three dedicated data nodes on bare metal, each with 64 cores, 755GB RAM, and 15TB SSD's along with 3 dedicated master nodes on VMs (4 cores, 32GB RAM). We index ~1600GB of data into 1 index, with 6 shards and 1 replica (~130GB per shard). We are willing to compromise indexing speed to maximize search speed.
We are currently seeing response times up to 5000ms at the 99th percentile on our queries. We are wondering if there are any immediate red flags in our architecture. Please let me know if more information is required to give proper advice.
Would it be best to eliminate the dedicated master nodes on VM's and just run the 3 bare metal servers as master/data nodes?
How can we maximize CPU and memory usage on these massive data nodes? Per the elasticsearch documentation, we are only allocating 31GB of heap to elastic. Our telemetry would indicate that the data nodes are not even coming close to being maxed out on cpu, memory, or I/O.
We are indexing a custom data model. We re-indexed our data using 30 shards (~20GB per shard). This gave us a very small performance boost.
So far, our biggest performance boost came from decreasing the heap enough to enable compressed ordinary object pointers (26GB in this case). We have found that decreasing the heap to 16GB seems to yield the best results, but still not where we would like to be.
We are now exploring G1GC rather than CMS. I will report back with our findings, though we are still open to any other pointers that people may have.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.