Warm storage of large (9TB) log data archives

AGR · February 19, 2016, 3:27pm

Hello,

We are working on a PoC to index large volumes of web log files - approx 200GB/day raw (400GB/day after indexing). Based on initial tests we think we can dedicate appropriate hardware to the 'hot' nodes which index today's logs.

We would like to keep previous days read-only indexes searchable for 1-2 months, moving these to lower spec 'warm' nodes. Search volumes will be extremely low (a handful per hour) with many days indexes not receiving any searches on a typical day. We unfortunately can't snapshot/restore these as when a search is required, this needs to be fulfilled quickly.

Some rough stats:

Approx 12 daily log indexes totaling 400GB (largest index approx. 200GB).
Shards per index can be tweaked, but currently ranges from 5 to 10 for the largest (max 20GB shard size).
We are using routing to route data (based on customer ID) to specific shards, and searches/aggregations will be limited to a single shard within an index.

To store a month's read only logs (weekdays only), this gives us:

Total storage: 9TB
Total indexes: ~260
Total shards: ~1,400!

9TB of disk storage seems achievable over a small number of nodes, but we are concerned about node memory requirements to be able to perform occasional searches/aggregations, and how many nodes would be recommended.

We would appreciate any advice, resources or similar experiences for deployments with large 'warm' indexes and low search volumes.

Thanks!

unknownunknown · February 19, 2016, 3:37pm

I've worked with similar systems. Would be interested in your indexing statistics if you can share.

In my experience, memory has been the limiting factor in keeping the indices available. This means either additional instances of ES on a machine with more memory, or multiple nodes in your cluster. When indices get that large, I've seen the segment memory grow significantly.

I would expect you to handily eat up 64GB of heap for this kind of solution, without considering disk cache. I would be interested in whether others think this is too much or too little.

nik9000 · February 19, 2016, 3:48pm

You can use shard allocation filtering to cause the move.

It depends on a few things:

If you use doc values or field data. Doc values are columnar storage on disk. Field data is in heap. You'll have a much better time using doc values.
What those searches look like. Terms aggregations, especially when they are nested in other terms aggregations can take up quite a bit of memory.

Topic		Replies	Views
Hot-Warm Architecture - Storage Handling Elasticsearch	4	697	September 25, 2018
Analyzing logs and document limit per shard Elasticsearch	11	1354	February 21, 2017
Scalability search Elasticsearch	4	302	July 6, 2017
[Help!] Number of indexes and shards per node Elasticsearch	9	3520	May 5, 2017
Dealing with large index collection strategy? Elasticsearch	6	1584	July 5, 2017

Warm storage of large (9TB) log data archives

Related topics