Query time increasing with increasing number of replicas and servers

I am seeing unexpected elastic behavior: with increasing the number of
replicas and servers, the query time is actually increasing, while I
would expect it to decrease.

I have a cluster with 8 million complex documents (overall about 300
million nested documents). I run complex queries, where each query involves
computing dozens of terms facets (top 100 values) on non-analyzed fields
with hundreds of thousands unique field values. I have a controlled load
testing environment, where I fire a number of queries simultaneously and
average measure query time. Since queries did not run very fast (5-10
seconds/query on average), I started adding servers and replicas. However,
not only did this not help query time, but it made things worse, as you can
see from the attached graph.

Any ideas of why this is happening?

--

Hello Zmicier,

Maybe the OS cache or Elasticsearch's own caches play a role here. How
many queries did you run for each test to get the average?

Another thing might be the distribution of load. Do you get different
results if you add a non-data node as a load balancer and run the same
queries on it?

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Oct 18, 2012 at 6:49 PM, Zmicier zialonka@gmail.com wrote:

I am seeing unexpected elastic behavior: with increasing the number of
replicas and servers, the query time is actually increasing, while I would
expect it to decrease.

I have a cluster with 8 million complex documents (overall about 300 million
nested documents). I run complex queries, where each query involves
computing dozens of terms facets (top 100 values) on non-analyzed fields
with hundreds of thousands unique field values. I have a controlled load
testing environment, where I fire a number of queries simultaneously and
average measure query time. Since queries did not run very fast (5-10
seconds/query on average), I started adding servers and replicas. However,
not only did this not help query time, but it made things worse, as you can
see from the attached graph.

Any ideas of why this is happening?

--

--

Hi Zmicier,

Bizarre indeed. Have a look at some other metrics, both ES and system one,
like disk IO, memory, CPU, etc. How do they change as you add more
servers? Did you check if shards were evenly spread in the cluster and
there were no shards being moved around the cluster when you ran your
tests? If you don't have a tool that can show you that,
Elasticsearch Monitoring will
help.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Thursday, October 18, 2012 11:49:41 AM UTC-4, Zmicier wrote:

I am seeing unexpected elastic behavior: with increasing the number of
replicas and servers, the query time is actually increasing, while I
would expect it to decrease.

I have a cluster with 8 million complex documents (overall about 300
million nested documents). I run complex queries, where each query involves
computing dozens of terms facets (top 100 values) on non-analyzed fields
with hundreds of thousands unique field values. I have a controlled load
testing environment, where I fire a number of queries simultaneously and
average measure query time. Since queries did not run very fast (5-10
seconds/query on average), I started adding servers and replicas. However,
not only did this not help query time, but it made things worse, as you can
see from the attached graph.

Any ideas of why this is happening?

--