Limit on # of data nodes?

Is there a limit on the number of data nodes or overall amount of data we can have in a cluster? How can we tell when we're nearing that limit?

We've been told a few times by elastic employees that our limit is probably around 100 data nodes, but we've seen reports of other people maintaining a single cluster much larger than this. What's a good indicator that it's time to split our data into two clusters? We'd like to know when we're reaching the limit far before we get to it.

There is no built in fixed limit and the cluster size probably depends on the use case. The larger the cluster gets and the more data it holds, the larger the cluster state gets. At some point this will start to cause problems and make the cluster less stable. I have also seen clusters larger than 100 nodes but now that there is cross-cluster search it may be a good size to aim for.

As Christian said, there's no technical limit.

We're finding that simply growing your cluster with no end in sight becomes a burden for management. Things like cross cluster search allow you to split data into easier to manage chunks, just like how you split data sources into indices that are aligned.

There's no good indicator though. You might find that you can get to 100 nodes and your tooling and management allow you to go further.

We've looked into it a bit but it doesn't look like we'd have much to gain using cross cluster search vs two separate clusters. We already use a separate index per customer, and it seems like either way we'd have to specify which customer is on which cluster/index.

When you say that our tooling/management may not allow us to go beyond 100 nodes, do you think we'd see high cpu/memory on the data or client nodes perhaps before this happens? Or can this just suddenly happen with no warning at all?

I mean more around resource orchestration and automation.

I think the master is probably the ultimate limiting factor within a single cluster. Most components scale sideways quite well, but there's only ever one elected master making decisions about the cluster, and there's a limit to how large or fast you can reasonably make that node. More clusters means more masters, each doing less work, which is good for everyone.

A pattern I've encountered a few times is where the cluster grows and grows without showing any problems, and then something happens to upset it and the master struggles to keep up with coordinating all the recovery work needed to bring things back to health. Often this is exacerbated by relaxing the defaults on certain safety mechanisms (for "scaling" reasons) without really understanding the consequences if, say, a network partition were to drop 25% of your nodes for a few minutes.

For instance, on the master there are a few places where we iterate over all the nodes and/or all the shards, which is totally reasonable when you've 50 nodes and 30k shards and pretty unreasonable when you've 500 nodes and 300k shards. We're aware of more efficient algorithms for cases like that, but usually not without making things worse for smaller clusters, and making it harder to maintain too, so we tend to avoid that complexity and instead target a limited cluster size.

30k shards isn't small by any measure, it could easily be a couple of PB.


I think that part we have pretty well figured out, we are just having a hard time approximating when our cluster will reach it's maximum recommended size. It's also hard to tell if hot data nodes that are reaching high cpu for seemingly no reason, could be an indicator that we need a second cluster.

We use Chef and Terraform to setup the nodes, monitor them with telegraf/grafana and have been trying out kibana/metricbeat as well. We also follow all of the recommendations in the sizing webinar, such as keeping to a specific memory:data ratio.

Are you using Monitoring to see what's happening in the cluster?

Nice, thanks! This is exactly the info I was looking for. This makes sense too, with some of the problems we've seen on our masters.

The monitoring info kibana has by default doesn't seem to provide a lot of information on what the cluster is doing. I'm pretty sure these nodes are responding to a lot of searches on a heavily used index though.

It should definitely show index and query requests. It might be worth starting another topic to dig into that (if you want) :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.