ElasticSearch space Requirement for productions

i am trying my best to get the most suitable configuration for my production server to deploy ElasticSearch. But after googling a lot of documents i did not figure out the exact hardware requirements for my purpose.

Below is my data volume:

Total data: 20 billions Total Disk Space: 50,000 GB (50 TB)

Now I have initially the following Linux server Configuration:

Processor: 4 core Ram: 16GB Disk Space: 500GB

server: Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz

My Questions:

What will be the optimal Disk Space and Shards No per node?
How many servers i required if i increase the diskspace vertically?
What will be the maximum disk space for the above configured server for that huge data to be distributed in multiple servers?
how many shards can i create for that huge data?
what should be the shard size?
How much data can be stored in a single shard(in GB) ?
Anybody's help is highly appreciated.
Thanks in advance.

It is almost impossible to give any advice based on the information provided as any recommendations will depend heavily on what your use case looks like. What type of data are you indexing? How much data is indexed per day and how long is data kept in the cluster? Do you index continuously or do updates in bulk? How is the cluster queried, e.g. type of queries as well as query throughput and latency requirements?

Hi Christian,

Thanks for reply. I am generating data from SQL database and indexing it to ElasticSearch using NEST Client. I am indexing data about 1,000 as a bulk index. So the indexing is a one time process for my requirements. I may require to index more data in future say about 2000 docs per day. The data will be kept in cluster for a longer time as it the category and state city data mostly.

My data structure is as follows:

"hits": {
"total": 70,
"max_score": 156.55983,
"hits": [
{
"_index": "localdata",
"_type": "doc",
"_id": "in.andamanonline.local:http/honda_car_b2c/in-2157",
"_score": 156.55983,
"_source": {
"meta_description": "Honda Car in andaman nicobar - local Honda Car business listings with contact address and phone numbers, get best deals from Honda Car in andaman nicobar",
"tstamp": "2015-10-08T12:39:00.655Z",
"digest": "7ae253b60a40aa6cd9ba45099ed4009f",
"host": "local.andamanonline.in",
"boost": "0.0",
"id": "in.andamanonline.local:http/honda_car_b2c/in-2157",
"title": "Honda Car in andaman nicobar, List of Honda Car in andaman nicobar - local.andamanonline.in",
"meta_keywords": "Honda Car in andaman nicobar",
"url": "http://local.andamanonline.in/honda_car_b2c/in-2157",
"content": "",
"h1tag": "Honda Car in andaman nicobar",
"category": "Car",
"subcategory": "Honda",
"servicename": "local"
}
},

i have index fields with Edgegram, StartWith, Shingles, Stemmed on the fields Meta_description, Meta_keywors, title,url, h1tag,content.

My query involved filtering on state/city and match_phrase_prefix query.

If you need more details please let me know.
Thanks

The best setup will depend on how you organise your indices, the performance characteristics of the hardware used, the exact nature of you queries, the number of queries per second you will need to support as well as the latency requirements you have. Unfortunately there is no formula or analytical method to determine this, so I would recommend performing some benchmarks on realistic hardware and data on a small number of nodes in order to determine what the ideal shard size is and how much data a node can handle given your use case and requirements.