I know that I'm asking a rather generic question that can't be answered without understanding my specific use case. However, I'm looking for some lower bounds. What's the minimum number of nodes, size of the node (in terms of number of CPUs, RAM), that would be required for N number of documents distributed across S TB of data with K fields?
I'm not expecting a purely mathematical answer to this, but want to know if any prior research has been done to establish at least some correlation.
There is no such formula as the answer depends on a large number of factors, e.g. indexing load, retention period, type and number of concurrent queries, query latency requirements, type of data, mappings and type of hardware used. All of these factors can have a major impact on the ideal size and composition of the cluster, so while there are simple guidelines based on disk to RAM ratios for different node types your best bet is to run tests with your data in order to get an accurate estimate.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.