I want to setup a logging cluster with ES. The ingestion should be 2000 Event per Seconds, meaning 160GB per day (hypothesis: 1 log is 1kb). The retention time is 365 days and the hot retention time is 15 days.
I took the following hypothesis but I'm not sure I'm right:
- 1 index for all logs;
- nearly 650 shards: the whole data (without replica) (160GB*365) with an overhead of 1.10 for indexing and 100GB per shard.
The whole data amount will be 133TB (with 1 replica) and I count a total of 26 warm nodes and 1 hot nodes (with 5TB storage per node). I choose 5TB storage per nodes according to the 40:1 and 100:1 ratio between RAM and disk for hot and warm nodes.
The RAM for hot nodes will be 128GB and for warm nodes 64GB. There will be nearly 24 shards per nodes.
To summarize, my architecture for 2000 EPS, 1 year retention and 15 days hot logs is composed of:
- 1 hot nodes with 128G RAM, 5TB disk;
- 26 warm nodes.
(I do not count there the master-nodes).
Is-it correct or oversized ?
Is-it possible to compress the data ? If I compress the data on warm nodes, I just have to take 20% of the data, so the architecture will only be composed of 5 warm nodes. Correct ?
Thanks a lot,