Query response time does NOT improve by adding additional nodes

For my thesis I am going to test the scalability/performance of the
Elasticsearch cluster depending on the number of nodes.

I have a physical server (with 8 cores, 30 GB memory) and on that physical
server I create* 4 virtual machines*, each assigned with 2 cores and 4GB of
memory.
On each virtual machine I install Elasticsearch, therefore each machine
represents a separate node in my Elasticsearch cluster.
The data contains* about 7 million documents*, I have* 5 shards and
replication_count=1 (default configuration)*.
I am executing a set of tests on the cluster with different
cluster-configurations. At first I assign only 1 node to the cluster and
execute the tests, then I assign 2 nodes to the cluster and execute the
same tests again, then 3 nodes etc.
The tests do not contain any fancy features (no facets, highlighting, etc),
but only filters and queries.

I expected that for each additional node in the cluster I would get a
linear improvement of the query response time. Because each node contains a
bit of the data (shard) and so the query could be executed on each shard at
the same time.
But it turns out that the response time of the query is not improving nor
is it getting worse, it almost stays the same....

So my question:
Are these results expected, that the query response time does NOT improve
when adding an additional node?

I execute the test queries one after another and not in parallel. Is
Elasticsearch made just to handle a high load of requests, but not for
improving the query time of a single request when I add additional nodes to
the cluster?
I mean, would I see only an improvement for each additional node, when I
would run the tests in parallel, because then the query load is
distributed?

Regards,
Herbert

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5d210ab6-08c0-4512-a6ee-fff677ae7207%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

For better results

  • you should align the number of nodes n with the shard count c, so that
    c/n = i where i is an integer
  • 7 million docs may be too few to see any effect, this depends on the node
    capacity
  • for higher perfomance, you should use one JVM per machine, not 4 virtual
    machines on one machine

The query response time will not get faster by adding nodes. The query
response time on a single shard is limited by hardware factors (like CPU
power, RAM, disk speed) and some software settings (ES settings, ES
caches).

The query response time will scale over nodes, that is, if you have
saturated one node, you can add more nodes, start over to add more docs,
while the query response time will not get higher by the number of added
docs. The reason is, ES submits the queries to the shards on the nodes in
parallel, so it makes no difference if there is one or thousands nodes in
the cluster.

Of course you should run query tests in parallel, if you want to measure
the cluster capacity / throughput.

Jörg

On Mon, Mar 24, 2014 at 9:54 AM, Herbert Bodner herbert.bodner@gmail.comwrote:

Are these results expected, that the query response time does NOT
improve when adding an additional node?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGaMyKWYGkPnn8wo_kzm4W%2BoSv62bNQNU0G2Dws9pK1hg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

thanks Jörg for your response.
I guess the bottleneck is the I/O (as you suggested).
So it does not matter if I add additional memory or CPU power to the
Elasticsearch cluster (with adding additional virtual machines), because
all nodes run on the same physical server with limited I/O capacity. If one
node already uses the whole I/O capacity, then adding additional nodes (on
the same physical server) does not help.

cheers

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c717f35d-8ecc-4f57-a175-225b166c9aed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.