Martin,
This is only my own experience. I have about 76,000,000 documents in one
index and one document type with 10 shards. I once tried 16 shards, but
then 10 shards loaded faster and queried faster. Not scientific, but I did
use the exact same set of data for bulk load and bulk update and queries.
That said, the recommendation I remember reading was: Increase the number
of shards to improve bulk load performance, and increase the number of
nodes to improve query performance.
Also, increasing the number of nodes is the only way to gain fault
tolerance.
The minimum cluster should be 3 nodes with a minimum of 2 masters. Beyond
that, N nodes with a minimum of (N/2)+1 masters. Disable multicast
discovery, and configure the same list of unicast hosts for each node's ES
instance. Then let ES do the balancing. It works remarkably well. Of
course, I cautiously upgrade ES; my 3-node cluster is now at 0.90.0 but
I've been testing with 0.90.3 and am ready to migrate the cluster soon.
For my initial bulk load, I created the index with all settings and
mappings, and (dynamically) configured the number of replicas to 0. I
dropped the index refresh interval to 2m (not infinity, but not the default
2s either). After that load, I updated the index refresh interval back to
2s, to specify 2 replicas, and then used ES Head to watch the shards
replicate. Very cool.
Then I apply my set of a few million "daily updates" in bulk, again
dropping the index refresh interval to 2m during the updates and then back
to 2s when they're done. But for normal trickle updates (these updates
don't come in clumps of several million at one time like they do in my test
setup), I will try leaving the refresh interval alone and see what happens.
The nginx server can be used to front the ES HTTP interface to the "outside
world" (outside the data center cluster, not to the real external Internet,
of course) to prevent an errant developer from issuing one curl command to
instantaneously delete an entire index and all of the data it contains. And
if you implement a server on top of ES that incorporates your business
logic, nginx is a good way to balance the load across your servers (which
in turn let ES balance the load across the ES cluster, which is likely the
real load). In addition, nginx could protect your ES instance from the
spurious delete requests, and also ensure that all access is through your
business logic and not directly to ES.
By the way, the du command shows that my index of 76,000,000 documents is
consuming 16 GB of data on my laptop. Very tiny. And the initial bulk load
took about 2.3 hours, which is breathtaking considering that the same slow
laptop disk was both reading the input data and used by ES to write and
rebalance the database itself. I haven't yet timed the bulk load across the
network; ES is so fast that even this worst case is awesome to me.
Just some thoughts based on my own experience.
Brian
On Tuesday, August 27, 2013 5:00:38 AM UTC-4, Martin wrote:
I'm wondering if I should use a ES-Cluster for my setting:
I've got 4 indexes with about 2.000.000 documents and 6 GB size
altogether. This fits without problems on one server.
Now my scaling-question:
I could use multiple instances of the same server with identical
ES-installations behind a loadbalancer (nginx)
If I need to scale more I just add instances.
Or is it better to use a cluster for this (the cluster has to communicate
within - isn't that slowing down?)
Does a Cluster only makes sense with very large indexes which don't fit on
one server ?
Thank you !
Martin
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.