So we have a situation where we want to make a mini-search engine for our project - the total volume of the data being 5 PB. We want to create a backup, so the total size to be needed overall is 10 PB, approximately. These are some of the specs we have in mind:
-
Each node will be 4 TB in size, SSD, with 64 GB of RAM (implying around 2500 nodes to accommodate the 10 PB data in total).
-
There is no hotness or coldness in data. Any data can be searched at any time. Like Google does not show results quickly if you search for "Donald Trump news" as compared to "Richard Feynman", both take around 0.5 seconds to show top 10 results, even if the first one comes up with tiles of latest news concerning Donald Trump. Something like that.
-
The SLA is normal search engine (Google/Bing/Duckduckgo) like - 0.5 to 1 second. The query can be anything - from "Britney Spears meltdown" to "history of KKK" to "1095 crusade". So all data has to be searched for top results every time.
-
We want shards of 50 GB approximate size, even distributed among the 2500 nodes, with each shard having one replica stored in some other node.
-
We can assume user traffic of say few hundred to a few thousand at the onset, and then the specs should be flexible enough to change to accommodate more traffic (may be a higher bandwidth NIC).
Question - are the 64 GB RAM / 4 TB SSD specs necessary for this SLA, or do we need more? Or will something less will work as well?
Secondly, we gave the specs to our datacenter guys, they are asking for some more details to be able to provide an approximate quote, so can you suggest some answers here: