Elastic search Design - Room for improvement

We are starting designing a cluster and come up with the following optimal configuration . Please suggest if there is any scope for improvement or save some budget if it is over optimized.

  1. 100 fields - 1 MB per field(including inverted index) - 125 MB( to be safer side) per document Total - 4M documents corresponds to 500 GB - 25GB per shard - 20 shards - total 40 shards including ( r = 1) - we had seen somewhere a shard size of 25GB works well in most of the scenarios.
  2. Also it seems heap max 32 GB per each shard RAM) ( jvm uses compressed pointers ) works well - which translates to 64 GB (rest 50% for FS cache) .So , considering 256 GB RAM - this translates 2 shards per machine (128GB) - this translates 20 Data nodes per cluster(2 shards per each data node ) , 3 master nodes (HA ) , 1 coordinating node

Please add your recommendations

Elasticsearch is generally not optimized for handling documents that large. Why are you documents that large? What do they contain? What is the use case?

do you see document of size 1MB is large considering in and around 100 fields ? and ES is not optimized for thse cases ?

I would consider documents over a few MB in size to be quite large and you mentioned them potentially being tens of MB in size, which I suspect potentially could be problematic and difficult to work with.

This does sound like a quite unusual use case, so I would recommend you test and benchmark it to find out how well it works.

sorry i mean 100 MB per document ( not per field) -

I was talking about document size, not field size.

What kind of data is this? What is the use case? How are you intending to search and use the results?