What are the requirements?

Hi

If I have 1500 devices and each device uploads about 3 GB of data per day
How much do I need a node? And what are their specifications such as RAM, hard disk and CPU ?

What kind of devices?

Desktops/Laptops? Servers? Network devices?

How did you arrive into this number of 3 GB per device per day?

For how many time you want to keep your data?

All those things will influence in the sizing of a cluster, but the best way to find out is still by testing with a proof of concept cluster.

Device Type : VM Ubuntu
From a server.

It reached 3 GB because it uploads data continuously + log.

I want to store data for a year.

What is the appropriate number of nudes? And what is the space to store this data for a year?

Am I understanding correctly that you collected data from one device for one day and that it resulted in 3GB of data on disk in Elasticsearch with the standard mappings? Is this on a single node with only primary shard(s) or does the 3GB include one or more replicas?

Another aspect worth considering is how quickly you need access to this data. A lot of users need immediate and fast access to the last 14 or 30 days worth of data, but are willing to experience/suffer longer search latencies or delays when accessing older data. This allows you to set up a tiered cluster that stores data more efficiently. What are your requirements around access to the older data?

Yes, I collect data from many devices and I took one device as a simple
And the data yes in one node and one shard and one replica

In order to help you with sizing a cluster you will need to be more precise.

  • How many devices did you collect data from?
  • What time period did these cover?
  • What is the exact size on disk?
  • How did you arrive at the estimate of 3GB of data per device and day?
  • Did you collect all the different types of data you expect to collect when in production?

If you only have a single node Elasticsearch can not allocate any replica shard so this means the size corresponds to primary shard(s) only.

What about these questions?

Currently I am collecting data from 20 devices and I want to conduct a study in order to expand and make 1500 devices
But out of the 20 devices, I took only one device, which is the average
I want the study to be for 1500 devices at the expense that each device uploads 3 gigabytes, but the truth is that there are devices that upload more than 3 gigabytes and there are devices that lift less than 3 gigabytes

I want the study to be for a year
But if you mean the 3 gigabytes it is only one day

We can assume that each device will generate 3GB of primary index data on disk per day. I just wanted to make sure that you are estimating this correctly as it will have a huge impact.

A year is a long time to store the data so the questions around access to the older data are critical, so please provide information about this.

The exact size varies there are up to 6 GB and there are up to 900 MB
But I take the average.
And its exact size you can say is less than 3 GB with something very simple

I found this by observing the Elastic during one day.

yes.

Yes, I was about to say that to you.

I want him to be as fast as possible.

I have a policy for storing data and it is limited to 14 days and the storage is transferred except for another index

But that means the data is still searchable and I need that to be as quick as possible.

Yes, I appreciate that very well.

What information exactly do you want me to provide you?

OK. Let's assume you are collecting data from 1500 devides and that each device generates 3GB of data on disk (without replicas) per day. That is a total of 4.5TB of new data per day that go into the cluster. I am suspect you want some resiliency, so will assume each index has 1 replica configured by default. This means that one days worth of data results in 9TB of data on disk. For a retention period of 1 year that equals a total of 1.65PB of primary shards and 3.3PB of total data size.

If you want this to all be accessible with low latency you are probably looking at nodes similar to the Sporage Optimized (dense) tier on Elastic Cloud. The largest nodes there have 8CPU cores, 60GB RAM and 4.7TB of fast SSD storage. If you only used this type of node you would need approximately 700 nodes, which would be quite expensive.

If you are deploying on-premise or on your own in the cloud you could have nodes with a different profile. If you choose to go with more storage per node and/or slower storage in order to save cost you would trade node density for lower query performance. Exactly how far you can go will depend on your latency requirements, so this is something you would need to test.

What most users with a long retention period do to get around this is that they use a tiered cluster approach where the cluster is divided into different zone with different node and performance characteristics. These are often referred to as hot, warm, cold and frozen tiers.

The hot nodes are similar in characteristic to the Elastic Cloud nodes I mentioned earlier. These nodes hold all the most recent indices and therefore handle all indexing load. This is typically very I/O intensive and require local SSD storage performance. A common assumption is also that the most recent data is the one most fequently queries, so the one that need to be served with the lowest latency to the most users. These nodes tend to hold maybe only 7 or 14 days worth of data. All indices on this tier are deployed with a replica shard for resiliency. Once data is older than the threshold it is migrated over to the warm set of nodes.

Warm tier nodes are likely similar to the hot nodes in terms of CPU and RAM, but as they do not handle any indexing they can often use larger volumes of also somewhat slower disks. This tier may hold perhaps one or two months worth of data. Once data is older than the threshold it is moved to the cold tier.

The cold tier relies on searchable snapshots which is a commercial feature that requires a commercial license. In this tier only the primary shard is deployed on the nodes and the resiliency is handled through a snapshot of the shards stored on e.g. S3. These nodes will also hold data for a specific time period and once the data is exceeds the threshold it is moved to the frozen tier. It is possible to skip the warm tier and move data directly to the cold tier to save cost.

The frozen tier work with all indices stored in S3 and only some parts of the data stored cached on the nodes. This type of node can handle very large amounts of data (hundreds of TB per node) which makes it very cost effective in a scenario like yours.

I will leave sizing of the cold and frozen ties to someone at Elastic as it is a commercial feature I do not have hands-on experience with.

Storing 14 days worth of data on hot tier style node like we described above would require 28 nodes. It may be worth skipping the warm tier and move data directly to the cold tier. If we assume these nodes hold twice the amount of data the hot nodes do and that no replica is required, this tier would need 15 nodes to hold 30 days worth of data. You then need to add some frozen nodes to the cluster, but I am not sure how many would be suitable.

As you can see a tiered cluster like the one described here can save you a lot of hardware cost and you can adapt the specification of the nodes and time data is stored in the different tiers to best suit your requirements. You would however need to test it and see how a tiered approach measures up to your query latency requirements and contact Elastic to size it properly if you want to use searchable snapshots.

You may also reduce size and cost by storing different types of data for different periods of time or have them transition tiers at different intervals.

2 Likes

Okay
Thank you very much for all this information, I will follow it carefully

Or I had another question

If we assume that I have stored data for the last 14 days in an SSD storage space and then I migrate the data to an HDD storage space

Can you tell me how this operation works?

For example, I will store data in an SSD, so how will I migrate data to HDD?

You would move data between different types of nodes using ILM in a hot-warm architecture (can be hot-warm-cold-frozen or hot-cold-frozen etc as described erlier)

I understand from your words that I cannot transfer data from one storage space to another?

Well, if you make 14 nodes with SSD storage and another 14 with HDD storage
Then make the hot layer on the SDD and the cold layer on the HDD

Does this help? Does it have side effects?