Warm storage of large (9TB) log data archives

Hello,

We are working on a PoC to index large volumes of web log files - approx 200GB/day raw (400GB/day after indexing). Based on initial tests we think we can dedicate appropriate hardware to the 'hot' nodes which index today's logs.

We would like to keep previous days read-only indexes searchable for 1-2 months, moving these to lower spec 'warm' nodes. Search volumes will be extremely low (a handful per hour) with many days indexes not receiving any searches on a typical day. We unfortunately can't snapshot/restore these as when a search is required, this needs to be fulfilled quickly.

Some rough stats:

  • Approx 12 daily log indexes totaling 400GB (largest index approx. 200GB).
  • Shards per index can be tweaked, but currently ranges from 5 to 10 for the largest (max 20GB shard size).
  • We are using routing to route data (based on customer ID) to specific shards, and searches/aggregations will be limited to a single shard within an index.

To store a month's read only logs (weekdays only), this gives us:

  • Total storage: 9TB
  • Total indexes: ~260
  • Total shards: ~1,400!

9TB of disk storage seems achievable over a small number of nodes, but we are concerned about node memory requirements to be able to perform occasional searches/aggregations, and how many nodes would be recommended.

We would appreciate any advice, resources or similar experiences for deployments with large 'warm' indexes and low search volumes.

Thanks!

2 Likes

I've worked with similar systems. Would be interested in your indexing statistics if you can share.

In my experience, memory has been the limiting factor in keeping the indices available. This means either additional instances of ES on a machine with more memory, or multiple nodes in your cluster. When indices get that large, I've seen the segment memory grow significantly.

I would expect you to handily eat up 64GB of heap for this kind of solution, without considering disk cache. I would be interested in whether others think this is too much or too little.

You can use shard allocation filtering to cause the move.

It depends on a few things:

  1. If you use doc values or field data. Doc values are columnar storage on disk. Field data is in heap. You'll have a much better time using doc values.
  2. What those searches look like. Terms aggregations, especially when they are nested in other terms aggregations can take up quite a bit of memory.