Sizing for ElasticStack in on-Prem environment

Hi Elastic Community,

I am in a preparation of hardware sizing for one of my customer for log analysis for below requirement:

Current Data size: 400GB

Daily data ingestion: 20GB

Retention Period: 30days

So, as per the below article, we have prepared hardware sizing as below:


Node ID Memory (GB) CPU Cores JVM Heap (GB) Disk Storage (GB) Role Elastic Component
master-01 16 4 8 100 master Elastic Server - Master Cluster Node
undefined ---- ---- ---- ---- ---- ----
master-02 16 4 8 100 master Elastic Server - Master Cluster Node
undefined ---- ---- ---- ---- ---- ----
master-03 16 4 8 100 master Elastic Server - Master Cluster Node
undefined ---- ---- ---- ---- ---- ----
data-01 64 16 30 833 data,ingest data node
undefined ---- ---- ---- ---- ---- ----
data-02 64 16 30 833 data,ingest data node - replication
undefined ---- ---- ---- ---- ---- ----
data-03 64 16 30 833 data,ingest data node
undefined ---- ---- ---- ---- ---- ----
data-04 64 16 30 833 data,ingest data node - replication
undefined ---- ---- ---- ---- ---- ----
Kibana 32 16 16 500 Visualization Kibana
undefined ---- ---- ---- ---- ---- ----
Logstash 8 4 4 100 Processing the data Logstash
undefined ---- ---- ---- ---- ---- ----

3 master nodes, 2 data nodes and 2 replicas nodes. However, I would like to ask in a firm that the above nodes are sufficient for our requirements. Also would like to know how to setup hot, warm and cold tier in above cluster setup.

Is there away we can setup MFA with community edition?

Kindly advice on above requests.

Thank you in advance.

Regards,

Eshwar

I’d recommend reviewing the data roles more closely, so you don’t keep all the information in the hot tier for too long and risk putting unnecessary load on the hot nodes.

Node roles | Elastic Docs

Hi @GiorgioS13 ,

Thank you for your suggestion.

I am planning to use the roles as below.

master role for 3 master nodes.

data, data_content, data_hot and data_warm for 2 data nodes. remaining two for replicas.

Kindly suggest, can I have Hot and Warm tiers with in the data nodes as we want to keep 30days retention?

Note: Indexes are max 2 in our environment.

Regards,

Eshwar

What are you looking to achieve by adopting a hot-warm architecture? Why is a simple architecture where all 4 data nodes are equal in a single tier not appropriate?

The main reason to go for a hot-.warm architecture is generally to allow different nodes to have different hardware specifications. Hot niodes do all indexing, which tends to be very I/O intensive and therefore often have very fast local SSD disks. Warm nodes do not perform any indexing and can therefore either support larger data volumes and/or have slower storage, which lowers cost. If all data nodes have the same specification I do not see any point in trying to adopt a hot-warm architecture. It only adds complexity for no real gain. Note that you can still have ILM policies that change index settings and delete data over time even if the data is not relocated between zones.

1 Like

Maybe it’s just semantics, but “two for replicas” isn’t really correct terminology.

If you have 4 data nodes you have 4 data nodes. Nodes aren’t replicated, indices (actually the shards) are replicated. Queries can get data from both primary and replica shards.

Generally people see “performance” as measured primarily by query /aggregation response time. In most cases, you want your queries making optimal use of all of your hardware, specifically CPU and memory, ie all nodes working/cooperating in parallel.

So , as well as deciding what hardware you will buy/use, also start thinking about how to map your data to that hardware.

1 Like

Thank you @Christian_Dahlqvist and @RainTown for your prompt responses.

What my thought process would be to distribute the data across the nodes if any of the data node is down then I should be able to query the data from others nodes.

And for Hot-Warm tier, planning to Implement the following ILM policy for optimal performance:

Hot Phase: 7 days on data nodes with fast SSD storage
Warm Phase: 23 days on standard storage (lower I/O requirements)
Delete Phase: After 30 days to maintain retention policy
Rollover: Daily indices or when reaching 50GB per index.

I would like to request what would be the best sizing for my requirement in case my approach is wrong. Much appreciate for your inputs.

Thank you

Regards,

Eshwar

In that case you would have 2 data nodes in the hot tier and 2 data nodes in the warm tier.

The hardware specification you provided above shows all data nodes having the same specification. There is no distinction of different storage types for different nodes and the size is exactly the same across the board.

What types of storage are you expecting/planning to use for the different data nodes?

What level of flexibility do you have when it comes to allocate resources to the different node types?

If you are ingesting 20GB per day and we make the assumption this is also the size data will take up on disk once indexed the total data volume in the cluster is likely to be 1200GB with 1 replica configured. That is not a lot of data and I think the cluster seems oversized for that volume. If these numbers are correct I see no need for a hot-warm architecture. Just because you can set it up that way and that is something that is often used for larger deployments does not mean it is the right thing for your specific use case.

@Christian_Dahlqvist, Thank you for your suggestions.

We have 400 GB of data at present in relational database and we would like to ingest 400GB of data to elastic at the beginning after setting up the cluster so it is likely to be 2000GB and 25% for operations with 1 replica configured i.e 2500GB.

Based on the above data and as per Elasticsearch documentation, we have decided to set up a 3-node cluster.

Based on your suggestions, a hot-warm architecture may not be necessary. Can you confirm if a 3-node cluster, where each node functions as both master and data, would be sufficient or would require data nodes as well?

We are planning to use SSD storage for the data nodes.

Regards,

Eshwar

For a small deployment, which it sounds like this is, that is exactly what I would recommend.

Wise choice.

Am not sure I understand the calculation, but for that data volume a single 3-node cluster where all nodes are master and data should work well and also be easy to administrate.

Is this 2 TB size counting replicas already?

@Christian_Dahlqvist , Thank you for prompt answer.

@leandrojmp, Yes, it is 2TB with replicas.

Regards,

Eshwar

@Christian_Dahlqvist, I have a little confusion where if I have to setup only 3 nodes cluster then same data will be available in all the nodes correct if I am wrong. Hence, I have to define 1 TB of disk on each node. So, how can I assign replica for my index in this scenario?

Regards,

Eshwar

If you set up your indices with 1 replica shard each of the 3 nodes will hold 1/3 of the data. All data does not need to reside on all nodes and primary and replica shards will be spread out across the nodes. For each index two nodes will hold the primary and the replica while the third does not hole a copy of that shard at all. These nodes will be different for different indices.

@Christian_Dahlqvist , Thank you for your clarification. I believe there is no impact for data when single node is down.

Can I consider below hardware sizing for my cluster? Please suggest!

3 Master nodes with 8CPUs, 68GB RAM and 2TB of SSD on each node.

Thank you in advance.

Regards,

Eshwar