We are working on a PoC to index large volumes of web log files - approx 200GB/day raw (400GB/day after indexing). Based on initial tests we think we can dedicate appropriate hardware to the 'hot' nodes which index today's logs.
We would like to keep previous days read-only indexes searchable for 1-2 months, moving these to lower spec 'warm' nodes. Search volumes will be extremely low (a handful per hour) with many days indexes not receiving any searches on a typical day. We unfortunately can't snapshot/restore these as when a search is required, this needs to be fulfilled quickly.
Some rough stats:
- Approx 12 daily log indexes totaling 400GB (largest index approx. 200GB).
- Shards per index can be tweaked, but currently ranges from 5 to 10 for the largest (max 20GB shard size).
- We are using routing to route data (based on customer ID) to specific shards, and searches/aggregations will be limited to a single shard within an index.
To store a month's read only logs (weekdays only), this gives us:
- Total storage: 9TB
- Total indexes: ~260
- Total shards: ~1,400!
9TB of disk storage seems achievable over a small number of nodes, but we are concerned about node memory requirements to be able to perform occasional searches/aggregations, and how many nodes would be recommended.
We would appreciate any advice, resources or similar experiences for deployments with large 'warm' indexes and low search volumes.