Scaling Factors

What are the top scaling factors for growing into a large ES cluster? My
operations department has some concerns around how we'd grow the size of an
ES cluster to support hundreds or thousands of nodes. Does an ES cluster
require a non-blocking network? They define this to be a network such that
all nodes are linked by uniform throughput and latency. They see ES' rack
aware configuration and worry a little having run into problems with Hadoop
clusters in the past falling over as the fast talking intra-rack nodes are
able to saturate the inter-rack data channels causing the overall system to
hit a wall (suddenly). They've also been burned by many hundreds of nodes
saturating a network trying to keep state synchronized.

From what I can tell the master servers ought to keep the state chatter
down.

Would someone in the know, or someone with experience with large ES
deployments mind describing what the important scaling factors of large ES
deployments are and at about what thresholds I'll be likely to hit them?

Thanks,
Jim

--

Hello Jim,

I wouldn't worry about the chat between nodes during "normal"
operation, but when ES is rebalancing. For example, when you add or
remove a node. In my experience, that dwarfs the regular traffic, even
if you have lots of queries or indexing.

Depending on the size of your data, you can estimate what kind of
traffic you'd have when rebalancing. So if I were you I would take the
worst-case scenario, do the math, add a buffer for my peace of mind,
then see if the network can handle it.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Oct 16, 2012 at 8:29 AM, Jim Hazen jimhazen2000@gmail.com wrote:

What are the top scaling factors for growing into a large ES cluster? My
operations department has some concerns around how we'd grow the size of an
ES cluster to support hundreds or thousands of nodes. Does an ES cluster
require a non-blocking network? They define this to be a network such that
all nodes are linked by uniform throughput and latency. They see ES' rack
aware configuration and worry a little having run into problems with Hadoop
clusters in the past falling over as the fast talking intra-rack nodes are
able to saturate the inter-rack data channels causing the overall system to
hit a wall (suddenly). They've also been burned by many hundreds of nodes
saturating a network trying to keep state synchronized.

From what I can tell the master servers ought to keep the state chatter
down.

Would someone in the know, or someone with experience with large ES
deployments mind describing what the important scaling factors of large ES
deployments are and at about what thresholds I'll be likely to hit them?

Thanks,
Jim

--

--