Hello All,
Would need your help in validation of few numbers, based on your experience.
Would be happy to get a response or some feedback.
I wanted to check if the below numbers are feasible to be handled by Elasticsearch.
100k documents per second
8.5 billion + documents in a day
Kafka topics will feed data. Maximum delay of 3 mins between data generation and for it to be consumed by elasticsearch.
3 primary/1 replica shard setup
Queries:2x the number of records inserted in a second. 200k or 300k per second on an average.
Based on above figures, how much would the data size be per day? I assumed it would be around 2000Gb, but is it more or less?
I had used elasticsearch sizing guides and arrived at below numbers
2000 GB * 7 days * ( 1 + 1 ) = 28000 GB ######## 28000 GB * 1.25 = 35000 GB ########## 35000/ 64 GB/ 30 + 1 = 19 hot nodes
2000 GB * 180 days * ( 1 + 1 ) = 720000 GB ######## 720000 GB * 1.25 = 900000 GB ####### 900000/ 64 GB / 160 + 1 = 88 nodes
(increased ratio of 50 and 250 )
2000 GB * 7 days * ( 1 + 1 ) = 28000 GB ######## 28000 GB * 1.25 = 35000 GB ########## 35000/ 64 GB/ 50 + 1 = 11 hot nodes
2000 GB * 180 days * ( 1 + 1 ) = 720000 GB ######## 720000 GB * 1.25 = 900000 GB ####### 900000/ 64 GB / 250 + 1 = 57 warm nodes
These numbers look too high but if its correct decided on below hardware.
CPU – minimum 32 cores per machine
RAM – 64 GB
11 hot nodes ---------------- Disk – 3 TB SSD per node
57 warm nodes -------------- Disk – 16 TB SSD per node
5 Master nodes(1.9 TB/2 TB storage not required on these nodes)
-
Does the above design of elasticsearch make sense? Unfortunately, i am unable to setup a small instance and test it out, so have to rely on estimates only.
-
Can we use compression for data older than 7 days? How do we do it? I do not want the current data to be compressed. According to online guides, this would result in 10-20% storage gains.
-
Suggestion: The 32 GB heap limit is too restrictive. It needs to be a 64 bit java process for such cases. Is it still the case of there is a 64 bit version or multiple parallel 32 G processes in pipeline?
Let me know if anything is needed.