I am in a preparation of hardware sizing for one of my customer for log analysis for below requirement:
Current Data size: 400GB
Daily data ingestion: 20GB
Retention Period: 30days
So, as per the below article, we have prepared hardware sizing as below:
Node ID
Memory (GB)
CPU Cores
JVM Heap (GB)
Disk Storage (GB)
Role
Elastic Component
master-01
16
4
8
100
master
Elastic Server - Master Cluster Node
undefined
----
----
----
----
----
----
master-02
16
4
8
100
master
Elastic Server - Master Cluster Node
undefined
----
----
----
----
----
----
master-03
16
4
8
100
master
Elastic Server - Master Cluster Node
undefined
----
----
----
----
----
----
data-01
64
16
30
833
data,ingest
data node
undefined
----
----
----
----
----
----
data-02
64
16
30
833
data,ingest
data node - replication
undefined
----
----
----
----
----
----
data-03
64
16
30
833
data,ingest
data node
undefined
----
----
----
----
----
----
data-04
64
16
30
833
data,ingest
data node - replication
undefined
----
----
----
----
----
----
Kibana
32
16
16
500
Visualization
Kibana
undefined
----
----
----
----
----
----
Logstash
8
4
4
100
Processing the data
Logstash
undefined
----
----
----
----
----
----
3 master nodes, 2 data nodes and 2 replicas nodes. However, I would like to ask in a firm that the above nodes are sufficient for our requirements. Also would like to know how to setup hot, warm and cold tier in above cluster setup.
Is there away we can setup MFA with community edition?
I’d recommend reviewing the data roles more closely, so you don’t keep all the information in the hot tier for too long and risk putting unnecessary load on the hot nodes.
What are you looking to achieve by adopting a hot-warm architecture? Why is a simple architecture where all 4 data nodes are equal in a single tier not appropriate?
The main reason to go for a hot-.warm architecture is generally to allow different nodes to have different hardware specifications. Hot niodes do all indexing, which tends to be very I/O intensive and therefore often have very fast local SSD disks. Warm nodes do not perform any indexing and can therefore either support larger data volumes and/or have slower storage, which lowers cost. If all data nodes have the same specification I do not see any point in trying to adopt a hot-warm architecture. It only adds complexity for no real gain. Note that you can still have ILM policies that change index settings and delete data over time even if the data is not relocated between zones.
Maybe it’s just semantics, but “two for replicas” isn’t really correct terminology.
If you have 4 data nodes you have 4 data nodes. Nodes aren’t replicated, indices (actually the shards) are replicated. Queries can get data from both primary and replica shards.
Generally people see “performance” as measured primarily by query /aggregation response time. In most cases, you want your queries making optimal use of all of your hardware, specifically CPU and memory, ie all nodes working/cooperating in parallel.
So , as well as deciding what hardware you will buy/use, also start thinking about how to map your data to that hardware.
What my thought process would be to distribute the data across the nodes if any of the data node is down then I should be able to query the data from others nodes.
And for Hot-Warm tier, planning to Implement the following ILM policy for optimal performance:
Hot Phase: 7 days on data nodes with fast SSD storage
Warm Phase: 23 days on standard storage (lower I/O requirements)
Delete Phase: After 30 days to maintain retention policy
Rollover: Daily indices or when reaching 50GB per index.
I would like to request what would be the best sizing for my requirement in case my approach is wrong. Much appreciate for your inputs.
In that case you would have 2 data nodes in the hot tier and 2 data nodes in the warm tier.
The hardware specification you provided above shows all data nodes having the same specification. There is no distinction of different storage types for different nodes and the size is exactly the same across the board.
What types of storage are you expecting/planning to use for the different data nodes?
What level of flexibility do you have when it comes to allocate resources to the different node types?
If you are ingesting 20GB per day and we make the assumption this is also the size data will take up on disk once indexed the total data volume in the cluster is likely to be 1200GB with 1 replica configured. That is not a lot of data and I think the cluster seems oversized for that volume. If these numbers are correct I see no need for a hot-warm architecture. Just because you can set it up that way and that is something that is often used for larger deployments does not mean it is the right thing for your specific use case.
We have 400 GB of data at present in relational database and we would like to ingest 400GB of data to elastic at the beginning after setting up the cluster so it is likely to be 2000GB and 25% for operations with 1 replica configured i.e 2500GB.
Based on the above data and as per Elasticsearch documentation, we have decided to set up a 3-node cluster.
Based on your suggestions, a hot-warm architecture may not be necessary. Can you confirm if a 3-node cluster, where each node functions as both master and data, would be sufficient or would require data nodes as well?
We are planning to use SSD storage for the data nodes.
For a small deployment, which it sounds like this is, that is exactly what I would recommend.
Wise choice.
Am not sure I understand the calculation, but for that data volume a single 3-node cluster where all nodes are master and data should work well and also be easy to administrate.
@Christian_Dahlqvist, I have a little confusion where if I have to setup only 3 nodes cluster then same data will be available in all the nodes correct if I am wrong. Hence, I have to define 1 TB of disk on each node. So, how can I assign replica for my index in this scenario?
If you set up your indices with 1 replica shard each of the 3 nodes will hold 1/3 of the data. All data does not need to reside on all nodes and primary and replica shards will be spread out across the nodes. For each index two nodes will hold the primary and the replica while the third does not hole a copy of that shard at all. These nodes will be different for different indices.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.