We are starting designing a cluster and come up with the following optimal configuration . Please suggest if there is any scope for improvement or save some budget if it is over optimized.
100 fields - 1 MB per field(including inverted index) - 125 MB( to be safer side) per document Total - 4M documents corresponds to 500 GB - 25GB per shard - 20 shards - total 40 shards including ( r = 1) - we had seen somewhere a shard size of 25GB works well in most of the scenarios.
Also it seems heap max 32 GB per each shard RAM) ( jvm uses compressed pointers ) works well - which translates to 64 GB (rest 50% for FS cache) .So , considering 256 GB RAM - this translates 2 shards per machine (128GB) - this translates 20 Data nodes per cluster(2 shards per each data node ) , 3 master nodes (HA ) , 1 coordinating node
Elasticsearch is generally not optimized for handling documents that large. Why are you documents that large? What do they contain? What is the use case?
I would consider documents over a few MB in size to be quite large and you mentioned them potentially being tens of MB in size, which I suspect potentially could be problematic and difficult to work with.
This does sound like a quite unusual use case, so I would recommend you test and benchmark it to find out how well it works.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.