Best Cluster Concept, Timeseries Data and more

Hi,
I am currently developing some cluster concept and once again asking for your opinions and insight. We will need different clusters for different customers. I have some questions regarding my concepts and I hope you can help with this.

Concept 1, Simple Cluster. Should be able to handle 200MB/day. Maybe more in the future. What should I do when I add new nodes? The elastic documentation states that after 3 Nodes I should start with giving some nodes specific rolles. The problem that I have with this is that I want tho have a reliable cluster so when I start to add specific rolles I will need at least 3 new nodes with these roles so that they are failproof. Right? So when I want nodes that are only Master I will need 3 new Nodes which only do the master job. That doesn't sound to good.

Concept 2, Master/Data:
This concept should be used when we have customers with more data. 20GB/day. All customers have timeseries data, so we will never change or delete any of it. Is this a good solution for this? What my concern is that the master nodes are useless and bored. I can't see why they would ever have to work with more than 5% of their power but still they would need 30% of the budget. At what point do master nodes become useful? Shard allocation and cluster managment doesn't sound resource
intensive.

Concept 3, Hot-Warm:
Is the following concept better for timeseries data? The data is still around 20GB/day. I am more happy with this concept because the master nodes are also the warm nodes. I don't think the warm nodes will get a lot of traffic so they should be fine. Or is this a bad idea?


Is the following concept really better? It would add a lot of costs. (If there is a good reason for master nodes we are happy to pay for it, but not if they are not that important)

Concept 4, Hot-Warm-Cold-Master:
This is for the biggest project. 200GB/day or more. I think in the following concept coordination nodes are missing. They sound really important when having such a big cluster and so much data. What is your opinion on coordination nodes? What configuration should they have?

To sum up some questions:

  • How important are master nodes, at what point should we use them?
  • What is a good cluster concept for timeseries data?
  • Is there anything "stupid" in my concepts? Open for improvements.
  • What do your clusters look like? For comparison.

Thank you for your time,
defalt

When sizing Elasticsearch you will need to take retention period into account. The total data size will drive the number of nodes required in the cluster and determine the optimal cluster structure.

Yes, and I know that. So I added the data size in all my questions. For the retention period we are really flexible and I think we will only need ILM for the bigger clusters. We will scale the hot nodes according to the amount of data and they will keep the data for 1-5Months I guess. What do you think of the concepts I showed above according to the data size?

  • 200MB/day.
  • 20GB/day.
  • 200GB/day.

Thanks

20GB a day is not a lot of data so I would expect a simple 3-node cluster where all nodes hold data and are master eligible to handle that. There should be no need for dedicated master nodes unless you need to add nodes to get more storage.

For larger volumes it often depends on the relation between indexed volume and retention, e.g. how fast your data turn over.

Ok so for the biggest cluster of 200GB/day we would have a retention of 20-50 days. After that they should be transferred to warm nodes. So we would have 3-4 Hot nodes with 2TB SDD and more warm nodes in the back. Is this a good idea or should we just use a Master/Data structure for this?

Lets say after 2 years we need more storage. Would we need to buy 3 new nodes to be master eligible (failproof) and new nodes for storage? Lets say we have 3 Nodes and they are full, what would be the next step?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.