Scaling: Cluster for speed or for size?

I'm wondering if I should use a ES-Cluster for my setting:

I've got 4 indexes with about 2.000.000 documents and 6 GB size altogether.
This fits without problems on one server.

Now my scaling-question:

I could use multiple instances of the same server with identical
ES-installations behind a loadbalancer (nginx)
If I need to scale more I just add instances.

Or is it better to use a cluster for this (the cluster has to communicate
within - isn't that slowing down?)
Does a Cluster only makes sense with very large indexes which don't fit on
one server ?

Thank you !

Martin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Martin,

This is only my own experience. I have about 76,000,000 documents in one
index and one document type with 10 shards. I once tried 16 shards, but
then 10 shards loaded faster and queried faster. Not scientific, but I did
use the exact same set of data for bulk load and bulk update and queries.

That said, the recommendation I remember reading was: Increase the number
of shards to improve bulk load performance, and increase the number of
nodes to improve query performance.

Also, increasing the number of nodes is the only way to gain fault
tolerance.

The minimum cluster should be 3 nodes with a minimum of 2 masters. Beyond
that, N nodes with a minimum of (N/2)+1 masters. Disable multicast
discovery, and configure the same list of unicast hosts for each node's ES
instance. Then let ES do the balancing. It works remarkably well. Of
course, I cautiously upgrade ES; my 3-node cluster is now at 0.90.0 but
I've been testing with 0.90.3 and am ready to migrate the cluster soon.

For my initial bulk load, I created the index with all settings and
mappings, and (dynamically) configured the number of replicas to 0. I
dropped the index refresh interval to 2m (not infinity, but not the default
2s either). After that load, I updated the index refresh interval back to
2s, to specify 2 replicas, and then used ES Head to watch the shards
replicate. Very cool.

Then I apply my set of a few million "daily updates" in bulk, again
dropping the index refresh interval to 2m during the updates and then back
to 2s when they're done. But for normal trickle updates (these updates
don't come in clumps of several million at one time like they do in my test
setup), I will try leaving the refresh interval alone and see what happens.

The nginx server can be used to front the ES HTTP interface to the "outside
world" (outside the data center cluster, not to the real external Internet,
of course) to prevent an errant developer from issuing one curl command to
instantaneously delete an entire index and all of the data it contains. And
if you implement a server on top of ES that incorporates your business
logic, nginx is a good way to balance the load across your servers (which
in turn let ES balance the load across the ES cluster, which is likely the
real load). In addition, nginx could protect your ES instance from the
spurious delete requests, and also ensure that all access is through your
business logic and not directly to ES.

By the way, the du command shows that my index of 76,000,000 documents is
consuming 16 GB of data on my laptop. Very tiny. And the initial bulk load
took about 2.3 hours, which is breathtaking considering that the same slow
laptop disk was both reading the input data and used by ES to write and
rebalance the database itself. I haven't yet timed the bulk load across the
network; ES is so fast that even this worst case is awesome to me.

Just some thoughts based on my own experience.

Brian

On Tuesday, August 27, 2013 5:00:38 AM UTC-4, Martin wrote:

I'm wondering if I should use a ES-Cluster for my setting:

I've got 4 indexes with about 2.000.000 documents and 6 GB size
altogether. This fits without problems on one server.

Now my scaling-question:

I could use multiple instances of the same server with identical
ES-installations behind a loadbalancer (nginx)
If I need to scale more I just add instances.

Or is it better to use a cluster for this (the cluster has to communicate
within - isn't that slowing down?)
Does a Cluster only makes sense with very large indexes which don't fit on
one server ?

Thank you !

Martin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Brian,

thanks a lot for your help with an initial setup. I will use it to test a
cluster environment.

What I meant with the "multiple instances of the same server" is:

  • Take the non-Cluster-ES installation (like the one your laptop) and
    deploy it to multiple servers (not connected as a node-cluster)
  • Then use nginx as a load-balancer to send requests in a round robin
    manner to each of them

Did you try this? It would be fault tolerant as well because the nginx
could stop sending requests if a server fails. This could also be used with
only two servers.

Thanks again!
Martin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Martin,

No, I don't do this.

For a read-only cluster (no writes) it would work, I suppose.

But my 3-node cluster is set up that way to let ES handle updates and
replication with load-balancing. It does much better job at a much more
efficient level.

And my ES client code is all 100% Java and uses the TransportClient
singleton to access the node or nodes of the cluster. And when accessing my
3-node (unicast-enabled) cluster, I add the 3 IP addresses of the 3 nodes
to the TransportClient. Then all requests for updates and queries are all
set up for automatic failover. Works really nicely!

I would only consider nginx to handle fail-over and load-balancing of the
Netty-based HTTP server that incorporates my business logic (and therefore
is the only production piece that accesses ES directly). But I would never
consider splitting a cluster into independent single-node clusters, as my
applications involve a heavy update load in addition to queries.

I hope this helps.

Brian

On Wednesday, August 28, 2013 10:44:02 AM UTC-4, Martin wrote:

Hi Brian,

thanks a lot for your help with an initial setup. I will use it to test a
cluster environment.

What I meant with the "multiple instances of the same server" is:

  • Take the non-Cluster-ES installation (like the one your laptop) and
    deploy it to multiple servers (not connected as a node-cluster)
  • Then use nginx as a load-balancer to send requests in a round robin
    manner to each of them

Did you try this? It would be fault tolerant as well because the nginx
could stop sending requests if a server fails. This could also be used with
only two servers.

Thanks again!
Martin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'm having a hard time thinking of a reason why you would want to run
several single-node clusters. It would be much more efficient to allow the
nodes to cluster together and share load across all your indices/queries.
You can still contain indices on a single machine using awareness
filtering if you want, although fault tolerance is achieved by allowing
replicas to exist on other machines.

The cluster does communicate with itself, but the overhead is very light.
It is an efficient binary protocol and not very chatty. Nodes only
communicate when they need something from each other, such as routing
search requests.

Lastly, nodes automatically round-robin requests amongst the cluster, which
eliminates the need for an external load balancer. Any node can reroute
requests to the correct node(s), so you can send your request to any node
in the cluster and let ES do the round-robining. Nginx is still useful for
outside proxying, like Brian said.

-Zach

PS the "laptop" version of ES is fully capable of clustering...its the same
as the "cluster" version of ES :wink:

On Tuesday, August 27, 2013 5:00:38 AM UTC-4, Martin wrote:

I'm wondering if I should use a ES-Cluster for my setting:

I've got 4 indexes with about 2.000.000 documents and 6 GB size
altogether. This fits without problems on one server.

Now my scaling-question:

I could use multiple instances of the same server with identical
ES-installations behind a loadbalancer (nginx)
If I need to scale more I just add instances.

Or is it better to use a cluster for this (the cluster has to communicate
within - isn't that slowing down?)
Does a Cluster only makes sense with very large indexes which don't fit on
one server ?

Thank you !

Martin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

OK Zach - I'm convinced I'll go for cluster.

Martin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.