Warm vs. Cold (+ Frozen) for archival

First, this is in the context of using Elastic Cloud. Our use case is primarily archiving logs and metrics mostly for compliance and security forensics. We'll be using some of the metrics in the short term (daily basis, but rarely looking at data more than a few days old). We're also a pretty small environment with a small team. I.E. there won't be much search activity because there's only a couple of us that would be looking. And most of the time we have other things to do.

Specifically, in the case where there is relatively little search activity and the primary purpose is archiving the data, what are the trade-offs between configuring 1 Cold (single copy with snapshots for backups) vs. 2 Warm nodes (2 copies of data)? Things that I can think of include:

Availability: if the single cold node goes down, that data is inaccessible until it's restored. Presumably the should be rare, but are there regular maintenance activities that might cause this on a regular basis?

Performance: Queries could be spread across two nodes in the 2x warm scenario. But if most of our query activity will be handled on the Hot tier, this is probably a mostly moot point. And it seems that from a hardware perspective the Warm and Cold tiers are effectively the same, so performance of a single query running on one of those nodes seems like it should be similar.

Scalability: If we decide we need more "active" data (not Frozen), scaling single copy of the data in the Cold tier can happen in smaller increments vs. replicated copies in Warm.

Am I missing something else?

Finally, is the difference between an ILM that sets the Warm tier to 0 replicas but has snapshots available for all it's data and a Cold tier that uses "searchable snapshot"?

1 Like

No ... all our maintenance are planned no downtime, I have a cluster with no downtime over a year.

Yup pretty much.

Yes you can create replicas in cold... we don't see it that often but yes you can.

Not Much :

Yes there are some subtle difference Warm is not backed by the searchable snapshot.. and thus is not as quickly resilient / good as Cold with Searchable Snapshots

IMHO Cold with Searchable Snapshot is slightly better than Warm with 0 Replicas

BTW Frozen is pretty incredible cost to performance ratio... A frozen node support ~90-100 TB of data and the caching is very smart... 1st request might be a bit slow but subsequent can be very fast.

1 Like

That wasn't exactly what I meant, but what I was thinking at the time is not necessarily correct either. I.E. if you have two Warm nodes and need more Warm capacity, you can't just increase the size of one node, you have to increase both so one doesn't run out before the other. But I believe you could add a third node. Which would make for a smaller (albeit more expensive / GB) upgrade.

I was leaning towards Cold being "better" myself, glad to hear I wasn't thinking incorrectly.

Yes, I started down this path as I was planning on adding Frozen to help with handling the volume of data that Elastic Security can generate. But that requires "Enterprise". But potentially going from 2x Warm to 1x Cold could offset at least part of that additional cost.

Hot x2 -> Cold x1 -> Frozen
is cheaper than
Hot x2 -> Warm x2 -> Frozen
and is slightly better than
Hot x2 -> Warm x1 -> Frozen

But can you do:
Hot x2 -> Frozen

If truly the majority of the queries can be satisfied from Hot, it seems like it should work? If so, having an intermediate cold/warm node only really becomes necessary if you need fast first search response?

Lost me a bit...Yes you can increase the node size until max node size and then you will need to add more nodes...

You will need to do a bit more math, depends on how long retention in each.
I would suggest a little POC

Yup you can do that.

Cold with searchable snapshots requires Enterprise as Well.

I consider us still in the middle of the POC right now. But it's not "little": there's a lot of complicating factors, and the capacity planning can't really be done until you really have everything sending data in because if you aren't already using Elastic, there's no way to know how much data you might be recording in Elastic.

I've found that to be especially true for Elastic Security which can send significantly different volumes of data depending on the activity on the system it's installed on. Getting multiple GBs of data / day from a system that's mostly idle was quite surprising. I was able to cut down on some of that noise, but it's still not insignificant.

I was able to extract the datastream sizes today though and thought it was interesting that Elastic Security and "es" (whatever .monitoring-es-8-mb is, apparently internal logging of some type?) are by far the largest components.


Not all of those data streams have the same number of days of data. And there's still some systems to add.

But the good news is that those top ones are not likely things that we'll ever access unless there's a specific problem so Frozen would seem to be perfect for anything older than a few days. Aside: it would be nice if the cloud UI would show the total storage by data stream and (ideally) by tier.

I do appreciate your responses!

I think we are working on some easier ways to see your ingest rates....

"Little" hehehe everything is relative ... I meant short :slight_smile:

BTW my last POC ~2TB / Day (Prod about ~10TB / Day) ...

Looks like you are on your way to .3-.5 TB / day or so... nice!

And yes planning ILM / Data Tiers is an important part...

Oh this is nowhere near that large... the oldest of those streams goes back to March 17 so a little over a month. Unless I misunderstand the data, that's the total storage for the stream, not just today's data.

Yup... Total (depending on what command you ran... primary / total)

What I do ...

Total Primary Size / Number of Docs = Avg Bytes / Doc

Then estimate number of Docs / Day

Total Primary / Day = Number of Docs / Day * Avg Doc Size

Or Events per Sec * Avg Size * 86400 =

However you like to do the math...

Total Storage = Primary Storage + Replica Storage

I think I prefer to be simpler and more direct: GB / days active = GB/day. Then scale up based on "I've only deployed this to 8 of 12 servers".

Both are complicated by different servers generating different amounts of data, but at the moment, I'm just doing rough budgeting number, with the understanding that there are details that we surely have yet to discover.

I'm sure from a "real" Elastic capacity and performance perspective the details of the number and size of documents / day (vs. just total GB/day) is important but for somebody just starting out with Elastic at moderate ingest levels, my assumption is that it's a moot point. I.E. I configured for "general purpose" and figured I should be covered. But configuring as "storage optimized" may be more appropriate and would save significantly on the hot tier (presumably because the RAM and vCPU for "storage optimized" is 1/2 as much as "general purpose"). So there's something to be said for trying to understand the requirements there. But in my particular case, looking at the performance panel, I rather think "general purpose" was probably the right choice. Although there may be a good argument for making it "storage optimized" but doubling up the storage.

Having spent some time in the pricer and done rough estimates of the expected GB/day based on what I've seen so far. Without Elastic Security, we would probably be fine on Premium with a Hot x2 + Warm x2 for probably <$6K/year. With Security, that would at least double. Or we could used Enterprise with Hot x2 + Frozen x1 for about the same price and get all the benefits from Elastic Security. That kind of sounds like a deal. If Frozen turns out to be too slow for our regular queries for non-Security data, it appears we could go to "storage optimized" and keep that data 2x as long on Hot for about the same price.

So I think I have a plan. And reconfiguring to do.

Thanks for all your help!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.