So we have a situation where we want to make a mini-search engine for our project - the total volume of the data being 5 PB. We want to create a backup, so the total size to be needed overall is 10 PB, approximately. These are some of the specs we have in mind:
Each node will be 4 TB in size, SSD, with 64 GB of RAM (implying around 2500 nodes to accommodate the 10 PB data in total).
There is no hotness or coldness in data. Any data can be searched at any time. Like Google does not show results quickly if you search for "Donald Trump news" as compared to "Richard Feynman", both take around 0.5 seconds to show top 10 results, even if the first one comes up with tiles of latest news concerning Donald Trump. Something like that.
The SLA is normal search engine (Google/Bing/Duckduckgo) like - 0.5 to 1 second. The query can be anything - from "Britney Spears meltdown" to "history of KKK" to "1095 crusade". So all data has to be searched for top results every time.
We want shards of 50 GB approximate size, even distributed among the 2500 nodes, with each shard having one replica stored in some other node.
We can assume user traffic of say few hundred to a few thousand at the onset, and then the specs should be flexible enough to change to accommodate more traffic (may be a higher bandwidth NIC).
Question - are the 64 GB RAM / 4 TB SSD specs necessary for this SLA, or do we need more? Or will something less will work as well?
Secondly, we gave the specs to our datacenter guys, they are asking for some more details to be able to provide an approximate quote, so can you suggest some answers here:
We do have some users with multi-PB clusters, and with 100k+ shards, but they typically have way fewer than 2500 nodes (even 250 nodes is a lot). I don't think there's any one-size-fits-all advice for a cluster of this size tho, you will need to invest in some research and experimentation for your specific use case. Bear in mind that this is a free community forum - we do our best, but speccing out a multi-million dollar system from scratch is some distance out of our pay grade
Haha, thanks. I was not asking others to spec out everything, just hoping that users can weigh in with some typical values for the specs asked of by our vendors, like bandwidth requirement (how mush makes sense for our use case - 20 Gbps?), no. of public IP address providers (10? 20? 50? All under load balancer?), etc. they have used in their own projects.
Just curious - the standard template seems to be node size of 3-4 TB with 64 GB of RAM, as far as I have seen. When you say multi-PB cluster with 250 nodes, that's like 20 TB per node. Is that optimal? 20 TB size node with 64 GB RAM?
There is no good way to answer your other questions other than to run your own experiments on some realistic data and workloads. Elasticsearch doesn't need any particular amount of bandwidth, this question in particular is completely determined by the quantity and types of queries you need to run.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.