Advice on Elasticsearch Architecture design

Hi all,

I am planning to setup an Elastic Stack with different type of nodes. For start there might be less traffic, but it can increase a lot in few months as many applications will be pushing lot of data.

My plan was to setup 3 nodes which is to hold data and act as master. In addition, there is 2 additional coordinating nodes in front of the master. The plan was that when there is a lot of traffic on the cluster, I dont need the data nodes to be overwhelmed with the requests. Each server is added 2CPU and 4G ram. Java heap size provided is 2G for each node.

I know this is not a right setup for what I am looking for. I want a suggestion on a solid architecture. And which nodes require more resources and how to make use of the resources and architecture for best performance. Any good advice is really appreciated

Thanks

Hi

You should be more concrete in your data: How much is "lot of data" ?
There are some rules, for example, 1 GB of heap should take no more than 20 shards (remember an index is one or more shards, and each replica takes the same number of shards as the index) no matter how big the shard is. If you have monitoring indices inside your cluster (not recommended for production environment) you will have several indices just for monitoring.
For hot nodes, you should go 30 to 1 in the disk space to RAM proportion, for every 30 GB of disk usage you need 1 GB of RAM (not heap, just RAM)

Hi Nahiko,

The data coming in is Time-based data, so there is metrics and logs. There is 1 Primary and 1 replica shard for each index which is by default. I am not sure if this the right parameters. For now we have 2 applications pushing data and the number of applications can increase to 50 and more. Each of the applications push to its own index and each index can take upto almost 10 - 20GB per day. Each day new index is created and the indices are currently managed by curator which removes indicies after 15 days. I need to increase the retention period to 30 days

I had increased the VM size to 4CPU and 8GB RAM and increased heap size to 4GB. The plan is to add additonal data nodes to the exisiting cluster if the resources are running low.

I need advice on how to design an architecture for ELK for index (Like should i follow the same approach or use ILM), VM size, and type of nodes to use

Thanks

Hi!

The number of replicas is up to you, there is not a default correct configuration. Having replicas you increase the performance of the queries (not always, but in general), as well as if you temporarily lose an Elasticsearch node, you will keep have all the results from your queries. Without replicas, in a node failure you could get all the results, some of them, or none of them, depending on which node your results come from
Also, if you lose a disk from a node, without replicas you lose the data that was on that disk, with replica, Elastic in most cases would recover everithing HOWEVER, REPLICAS IS NOT A BACKUP SYSTEM, snapshots are.
Having a replica takes exacly double the disk space.
Now that you have the data, it is up to you to decide

The recommended size for a shard (an index is one or more shards) is between 20-40 GB, as you will get 10-20 GB indices (shards) maybe you should create indices every two or three days, or do something to get shards as close to 40 GB as you can. Do not create shards that will get very low on disk, as you will be wasting RAM memory. Yo can also put data from some applications in the same index, not just one index per application.

Using Curator or ILM, there is not a best practice here, ILM is integrated with ILK and it has a graphical interface so it is easier to use, and it also does not depend on external scripts which could fail to be executed, again, it is up to you

VM Size, as I told you before, it depends on your data, the recommended is 1 GB of RAM for every 30 GB of disk space used for hot nodes, and the rule changes to 1 to 100 on warm nodes, HOWEVER, you must test as much as you can with data as close to real world data as you can and then you will see if you need more o less.
Never set more heap for Elasticsearch than half the system RAM.
If in the beginning you will get 3 nodes, use them all as all purpose nodes, master, ingest, etc...
As your cluster grows, you will be able to get hot nodes, warm nodes, master nodes, etc...

If you have not, take a big look at mappings documentation, it is one of the most important part of Elasticsearch.