Is it possible for a cluster to have about hundreds (or thousands) of nodes at one time?
I heard that there is a kind of "soft limit" in the size of a cluster due to the gossip performance, so 150 or thereabout is considered to be a maximum. Is this real? I've looked around Elasticsearch's official documents and lots of other blogs, but couldn't find out any information about the maximum node count in one cluster.
I totally understand that this all depends on network latency, system performance or whatever, but I'd like to know if there would be some "tangible" or "feasible" total node count in one cluster.
The reason why I'm asking is - I'd like to have petabyte scale of data in one cluster.
There is no soft limit built into Elasticsearch, and I have seen reports of users with several hundred nodes in a cluster. What generally limits the size the cluster can have is the size of the cluster state and how fast this can be updated and propagated. The larger the cluster state is, the longer it will take to update. The more nodes you have in the cluster, the longer it may also take to propagate changes to all nodes.
One thing that takes up a lot of space in the cluster state is information about where all the shards are located and what the current status is. This can also change frequently in a large cluster and require the cluster state to be updated. In order to maximise cluster size it is therefore generally important to minimise the shard count in the cluster.
Most users that do have a lot of Elasticsearch nodes have identified a cluster size that works for them and are running multiple clusters. This can be higher or lower that the 150 data nodes quoted, but 150 sounds like a reasonable starting point.
Given that cross-cluster search now is available, which scales considerably better than the old tribe nodes, having multiple clusters and querying across them should not be a problem.
Yeah, cross-cluster search sounds great. Can I ask for what you think about Elasticsearch-Hadoop for this purpose(saving petabytes of data)? Elasticsearch-Hadoop has been made for this problem, right?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.