What Algorithm does Elasticsearch use to uniformly distribute data across different nodes in the cluster? How does it deal with new nodes / dead nodes?
Does it use disk space, consistent hashing? Any resources that you can share would be greatly appreciated. Thank you.
Data is split across shards by hashing the document ID, dividing the hash by the number of shards and taking the remainder.
Shards are allocated to nodes taking a number of factors into account, including disk space. The reference manual has all the details.
New nodes and dead nodes aren't treated particularly specially by the algorithm. Allocation decisions are made assuming that the membership of the cluster is fixed. When an empty node joins the cluster Elasticsearch will relocate some data onto it so that each node holds roughly the same number of shards. If a node fails then the shards it held are distributed among the remaining nodes, although there's a short delay before doing anything in case it comes back.
Which documentID is this? The only documentID I am familiar in ES is the actual index document ID. Also, how can you divide the hash (It has hex alphabets and numbers).
The hash used in Elasticsearch is a 32-bit number. The hashes that you are thinking of, containing digits and letters, are still numbers; they are often written in hexadecimal so they're shorter and because computers can deal with numbers more efficiently when they're written in hexadecimal than when they're in decimal (i.e. only using digits).
It picks some of the shards that satisfy the constraints (disk space, allocation filtering, etc.) measuring approximately how "balanced" the cluster is and looking for relocations that improve its balance the most.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.