On Tue, 2010-05-25 at 12:35 -0700, John Chang wrote:
I ran load tests to find out how fast I could index documents. I am using
the java client with Elasticsearch 0.7.0, and I found that adding more data
nodes didn't really speed things up. Having only one data node was faster;
I guess because it did not have to do any routing or balancing, but it is
not sustainable for my needs. Having 2 and 4 data nodes gave me about the
same transactions per second. Is this expected? Any advice on speeding up
indexing throughput? Thanks.
I've seen the same thing, and kimchy has just explained to me on IRC how
things should work, so I thought I'd share it here for those who, like
me, are new to distributed workloads.
There are two concerns being addresses in a cluster:
 - performance/scalability
- high availability/redundancy
There are three variables which we can change to affect the above:
 - index.number_of_shards (default 5)
- index.number_of_replicas (default 2)
- the number of nodes we start
index.number_of_shards is intended to improve performance - the more
shards (within reason) the more we can divide up the work between nodes.
index.number_of_replicas is intended to improve redundancy AND search
performance.
- The more replicas, the better chance that our cluster will keep going
when a node goes down. The number of replicas is in addition to the
primary shard, so with the default of 2, we have the primary plus 2
replicas (a total of 3)
- The more replicas, the more that search requests can be distributed
across more nodes
Having more replicas comes at a cost. A newly indexed document needs to
be added to the primary shard, then copied to each replica.
So for example:
- we start 3 nodes
- we create an index with 5 shards and 2 replicas.
- this gives us a total of 15 shards (5 primaries and 10 replicas).
- searching across this cluster is faster than searching against one
node, as the data is spread across 3 nodes
- indexing a new doc in this cluster will be slightly slower than
indexing to a single node, because of the replication costs
Now we add a fourth node.
- we still have 15 shards, but now they are spread across 4 machines,
- searching speed improves because the data is spread across more
machines
- indexing speed improves because now each node is doing less work
than a single node
So to see similar indexing performance improvements, you could decrease
the number of replicas to 1, but then you're potentially increasing the
risk of cluster downtime.
This is as I understand it - please correct me if I have something
wrong.
clint
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.