For my thesis I am going to test the scalability/performance of the
Elasticsearch cluster depending on the number of nodes.
I have a physical server (with 8 cores, 30 GB memory) and on that physical
server I create* 4 virtual machines*, each assigned with 2 cores and 4GB of
memory.
On each virtual machine I install Elasticsearch, therefore each machine
represents a separate node in my Elasticsearch cluster.
The data contains* about 7 million documents*, I have* 5 shards and
replication_count=1 (default configuration)*.
I am executing a set of tests on the cluster with different
cluster-configurations. At first I assign only 1 node to the cluster and
execute the tests, then I assign 2 nodes to the cluster and execute the
same tests again, then 3 nodes etc.
The tests do not contain any fancy features (no facets, highlighting, etc),
but only filters and queries.
I expected that for each additional node in the cluster I would get a
linear improvement of the query response time. Because each node contains a
bit of the data (shard) and so the query could be executed on each shard at
the same time.
But it turns out that the response time of the query is not improving nor
is it getting worse, it almost stays the same....
So my question:
Are these results expected, that the query response time does NOT improve
when adding an additional node?
I execute the test queries one after another and not in parallel. Is
Elasticsearch made just to handle a high load of requests, but not for
improving the query time of a single request when I add additional nodes to
the cluster?
I mean, would I see only an improvement for each additional node, when I
would run the tests in parallel, because then the query load is
distributed?
you should align the number of nodes n with the shard count c, so that
c/n = i where i is an integer
7 million docs may be too few to see any effect, this depends on the node
capacity
for higher perfomance, you should use one JVM per machine, not 4 virtual
machines on one machine
The query response time will not get faster by adding nodes. The query
response time on a single shard is limited by hardware factors (like CPU
power, RAM, disk speed) and some software settings (ES settings, ES
caches).
The query response time will scale over nodes, that is, if you have
saturated one node, you can add more nodes, start over to add more docs,
while the query response time will not get higher by the number of added
docs. The reason is, ES submits the queries to the shards on the nodes in
parallel, so it makes no difference if there is one or thousands nodes in
the cluster.
Of course you should run query tests in parallel, if you want to measure
the cluster capacity / throughput.
thanks Jörg for your response.
I guess the bottleneck is the I/O (as you suggested).
So it does not matter if I add additional memory or CPU power to the
Elasticsearch cluster (with adding additional virtual machines), because
all nodes run on the same physical server with limited I/O capacity. If one
node already uses the whole I/O capacity, then adding additional nodes (on
the same physical server) does not help.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.