Clustering/Sharding impact on query performance

Hi,

SCENARIO

Our Elasticsearch database has ~2.5 million entries. Each entry has the
three analyzed fields "match", "sec_match" and "thi_match" (all contains
3-20 words) that will be used in this query:

ES runs on two types of servers:
(1) Real servers (system has direct access to real CPUs, no virtualization)
of newest generation - Very performant!
(2) Cloud servers with virtualized CPUs - Poor CPUs, but this is generic
for cloud services.

See https://gist.github.com/anonymous/3098b142c2bab51feecc for (1) and (2)
CPU details.

ES settings:
ES version 1.2.0 (jdk1.8.0_05)
ES_HEAP_SIZE = 512m (we also tested with 1024m with same results)
vm.max_map_count = 262144
ulimit -n 64000
ulimit -l unlimited
index.number_of_shards: 1
index.number_of_replicas: 0
index.store.type: mmapfs
threadpool.search.type: fixed
threadpool.search.size: 75
threadpool.search.queue_size: 5000

Infrastructure:
As you can see above, we don't use the cluster feature of ES (1 shard, 0
replicas). The reason is that our hosting infrastructure is based on
different providers.
Upside: We aren't dependent on a single hosting provider. Downside: Our
servers aren't in the same LAN.

This means:

  • We cannot use ES sharding, because synchronisation via WAN (internet)
    seems not a useful solution.
  • So, every ES-server has the complete dataset and we configured only one
    shard and no replicas for higher performance.
  • We have a distribution process that updates the ES data on every host
    frequently. This process is fine for us, because updates aren't very often
    and perfect just-in-time ES synchronisation isn't necessary for our
    business case.
  • If a server goes down/crashs, the central loadbalancer removes it (the
    resulting minimal packet lost is acceptable).

PROBLEM

For long query terms (6 and more keywords), we have very high CPU loads,
even on the high performance server (1), and this leads to high response
times: 1-4sec on server (1), 8-20sec on server (2). The system parameters
while querying:

  • Very high load (usually 100%) for the thread responsible CPU (the other
    CPUs are idle in our test scenario)
  • No I/O load (the harddisks are fine)
  • No RAM bottlenecks

So, we think the file caching is working fine, because we have no I/O
problems and the garbage collector seams to be happy (jstat shows very few
GCs). The CPU is the problem, and ES hot-threads point to the Scorer module:

SUMMARY/ASSUMPTIONS

  • Our database size isn't very big and the query not very complex.
  • ES is designed for huge amount of data, but the key is
    clustering/sharding: Data distribution to many servers means smaller
    indices, smaller indices leads to fewer CPU load and short response times.
  • So, our database isn't big, but to big for a single CPU and this means
    especially low performance (virtual) CPUs can only be used in sharding
    environments.

If we don't want to lost the provider independency, we have only the
following two options:

  1. Simpler query (I think not possible in our case)
  2. Smaller database

QUESTIONS

Are our assumptions correct? Especially:

  • Is clustering/sharding (also small indices) the main key to performance,
    that means the only possibility to prevent overloaded (virtual) CPUs?
  • Is it right that clustering is only useful/possible in LANs?
  • Do you have any ES configuration or architecture hints regarding our
    preference for using multiple hosting providers?

Thank you. Rgds
Fin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d1ad2e3c-6d16-493b-a066-1fa2a06a29a6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Any hints?

On Monday, July 7, 2014 3:51:19 PM UTC+2, Fin Sekun wrote:

Hi,

SCENARIO

Our Elasticsearch database has ~2.5 million entries. Each entry has the
three analyzed fields "match", "sec_match" and "thi_match" (all contains
3-20 words) that will be used in this query:
gist:a8d1142512e5625e4e91 · GitHub

ES runs on two types of servers:
(1) Real servers (system has direct access to real CPUs, no
virtualization) of newest generation - Very performant!
(2) Cloud servers with virtualized CPUs - Poor CPUs, but this is generic
for cloud services.

See gist:3098b142c2bab51feecc · GitHub for (1) and
(2) CPU details.

ES settings:
ES version 1.2.0 (jdk1.8.0_05)
ES_HEAP_SIZE = 512m (we also tested with 1024m with same results)
vm.max_map_count = 262144
ulimit -n 64000
ulimit -l unlimited
index.number_of_shards: 1
index.number_of_replicas: 0
index.store.type: mmapfs
threadpool.search.type: fixed
threadpool.search.size: 75
threadpool.search.queue_size: 5000

Infrastructure:
As you can see above, we don't use the cluster feature of ES (1 shard, 0
replicas). The reason is that our hosting infrastructure is based on
different providers.
Upside: We aren't dependent on a single hosting provider. Downside: Our
servers aren't in the same LAN.

This means:

  • We cannot use ES sharding, because synchronisation via WAN (internet)
    seems not a useful solution.
  • So, every ES-server has the complete dataset and we configured only one
    shard and no replicas for higher performance.
  • We have a distribution process that updates the ES data on every host
    frequently. This process is fine for us, because updates aren't very often
    and perfect just-in-time ES synchronisation isn't necessary for our
    business case.
  • If a server goes down/crashs, the central loadbalancer removes it (the
    resulting minimal packet lost is acceptable).

PROBLEM

For long query terms (6 and more keywords), we have very high CPU loads,
even on the high performance server (1), and this leads to high response
times: 1-4sec on server (1), 8-20sec on server (2). The system parameters
while querying:

  • Very high load (usually 100%) for the thread responsible CPU (the other
    CPUs are idle in our test scenario)
  • No I/O load (the harddisks are fine)
  • No RAM bottlenecks

So, we think the file caching is working fine, because we have no I/O
problems and the garbage collector seams to be happy (jstat shows very few
GCs). The CPU is the problem, and ES hot-threads point to the Scorer module:
Elasticsearch Hot-Threads · GitHub

SUMMARY/ASSUMPTIONS

  • Our database size isn't very big and the query not very complex.
  • ES is designed for huge amount of data, but the key is
    clustering/sharding: Data distribution to many servers means smaller
    indices, smaller indices leads to fewer CPU load and short response times.
  • So, our database isn't big, but to big for a single CPU and this means
    especially low performance (virtual) CPUs can only be used in sharding
    environments.

If we don't want to lost the provider independency, we have only the
following two options:

  1. Simpler query (I think not possible in our case)
  2. Smaller database

QUESTIONS

Are our assumptions correct? Especially:

  • Is clustering/sharding (also small indices) the main key to performance,
    that means the only possibility to prevent overloaded (virtual) CPUs?
  • Is it right that clustering is only useful/possible in LANs?
  • Do you have any ES configuration or architecture hints regarding our
    preference for using multiple hosting providers?

Thank you. Rgds
Fin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/18434f95-5587-48fe-bc6e-214155decec0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I would test using multiple primary shards on a single machine. Since your
dataset seems to fit into RAM, this could help for these longer latency
queries.

On Thursday, July 10, 2014 12:24:26 AM UTC-7, Fin Sekun wrote:

Any hints?

On Monday, July 7, 2014 3:51:19 PM UTC+2, Fin Sekun wrote:

Hi,

SCENARIO

Our Elasticsearch database has ~2.5 million entries. Each entry has the
three analyzed fields "match", "sec_match" and "thi_match" (all contains
3-20 words) that will be used in this query:
gist:a8d1142512e5625e4e91 · GitHub

ES runs on two types of servers:
(1) Real servers (system has direct access to real CPUs, no
virtualization) of newest generation - Very performant!
(2) Cloud servers with virtualized CPUs - Poor CPUs, but this is generic
for cloud services.

See gist:3098b142c2bab51feecc · GitHub for (1) and
(2) CPU details.

ES settings:
ES version 1.2.0 (jdk1.8.0_05)
ES_HEAP_SIZE = 512m (we also tested with 1024m with same results)
vm.max_map_count = 262144
ulimit -n 64000
ulimit -l unlimited
index.number_of_shards: 1
index.number_of_replicas: 0
index.store.type: mmapfs
threadpool.search.type: fixed
threadpool.search.size: 75
threadpool.search.queue_size: 5000

Infrastructure:
As you can see above, we don't use the cluster feature of ES (1 shard, 0
replicas). The reason is that our hosting infrastructure is based on
different providers.
Upside: We aren't dependent on a single hosting provider. Downside: Our
servers aren't in the same LAN.

This means:

  • We cannot use ES sharding, because synchronisation via WAN (internet)
    seems not a useful solution.
  • So, every ES-server has the complete dataset and we configured only one
    shard and no replicas for higher performance.
  • We have a distribution process that updates the ES data on every host
    frequently. This process is fine for us, because updates aren't very often
    and perfect just-in-time ES synchronisation isn't necessary for our
    business case.
  • If a server goes down/crashs, the central loadbalancer removes it (the
    resulting minimal packet lost is acceptable).

PROBLEM

For long query terms (6 and more keywords), we have very high CPU loads,
even on the high performance server (1), and this leads to high response
times: 1-4sec on server (1), 8-20sec on server (2). The system parameters
while querying:

  • Very high load (usually 100%) for the thread responsible CPU (the other
    CPUs are idle in our test scenario)
  • No I/O load (the harddisks are fine)
  • No RAM bottlenecks

So, we think the file caching is working fine, because we have no I/O
problems and the garbage collector seams to be happy (jstat shows very few
GCs). The CPU is the problem, and ES hot-threads point to the Scorer module:
Elasticsearch Hot-Threads · GitHub

SUMMARY/ASSUMPTIONS

  • Our database size isn't very big and the query not very complex.
  • ES is designed for huge amount of data, but the key is
    clustering/sharding: Data distribution to many servers means smaller
    indices, smaller indices leads to fewer CPU load and short response times.
  • So, our database isn't big, but to big for a single CPU and this means
    especially low performance (virtual) CPUs can only be used in sharding
    environments.

If we don't want to lost the provider independency, we have only the
following two options:

  1. Simpler query (I think not possible in our case)
  2. Smaller database

QUESTIONS

Are our assumptions correct? Especially:

  • Is clustering/sharding (also small indices) the main key to
    performance, that means the only possibility to prevent overloaded
    (virtual) CPUs?
  • Is it right that clustering is only useful/possible in LANs?
  • Do you have any ES configuration or architecture hints regarding our
    preference for using multiple hosting providers?

Thank you. Rgds
Fin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/046b78ca-9173-4fa0-ae5d-309a716c9dc3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Kireet, thanks for your answer and sorry for the late response. More
shards doesn't help. It will slow down the system because each shard takes
quite some overhead to maintain a Lucene index and, the smaller the shards,
the bigger the overhead. Having more shards enhances the indexing
performance and allows to distribute a big index across machines, but I
don't have a cluster with a lot of machines. I could observe this negative
effects while testing with 20 shards.

It would be very cool if somebody could answer/comment to the question
summarized at the end of my post. Thanks again.

On Friday, July 11, 2014 3:02:50 AM UTC+2, Kireet Reddy wrote:

I would test using multiple primary shards on a single machine. Since your
dataset seems to fit into RAM, this could help for these longer latency
queries.

On Thursday, July 10, 2014 12:24:26 AM UTC-7, Fin Sekun wrote:

Any hints?

On Monday, July 7, 2014 3:51:19 PM UTC+2, Fin Sekun wrote:

Hi,

SCENARIO

Our Elasticsearch database has ~2.5 million entries. Each entry has the
three analyzed fields "match", "sec_match" and "thi_match" (all contains
3-20 words) that will be used in this query:
gist:a8d1142512e5625e4e91 · GitHub

ES runs on two types of servers:
(1) Real servers (system has direct access to real CPUs, no
virtualization) of newest generation - Very performant!
(2) Cloud servers with virtualized CPUs - Poor CPUs, but this is generic
for cloud services.

See gist:3098b142c2bab51feecc · GitHub for (1) and
(2) CPU details.

ES settings:
ES version 1.2.0 (jdk1.8.0_05)
ES_HEAP_SIZE = 512m (we also tested with 1024m with same results)
vm.max_map_count = 262144
ulimit -n 64000
ulimit -l unlimited
index.number_of_shards: 1
index.number_of_replicas: 0
index.store.type: mmapfs
threadpool.search.type: fixed
threadpool.search.size: 75
threadpool.search.queue_size: 5000

Infrastructure:
As you can see above, we don't use the cluster feature of ES (1 shard, 0
replicas). The reason is that our hosting infrastructure is based on
different providers.
Upside: We aren't dependent on a single hosting provider. Downside: Our
servers aren't in the same LAN.

This means:

  • We cannot use ES sharding, because synchronisation via WAN (internet)
    seems not a useful solution.
  • So, every ES-server has the complete dataset and we configured only
    one shard and no replicas for higher performance.
  • We have a distribution process that updates the ES data on every host
    frequently. This process is fine for us, because updates aren't very often
    and perfect just-in-time ES synchronisation isn't necessary for our
    business case.
  • If a server goes down/crashs, the central loadbalancer removes it (the
    resulting minimal packet lost is acceptable).

PROBLEM

For long query terms (6 and more keywords), we have very high CPU loads,
even on the high performance server (1), and this leads to high response
times: 1-4sec on server (1), 8-20sec on server (2). The system parameters
while querying:

  • Very high load (usually 100%) for the thread responsible CPU (the
    other CPUs are idle in our test scenario)
  • No I/O load (the harddisks are fine)
  • No RAM bottlenecks

So, we think the file caching is working fine, because we have no I/O
problems and the garbage collector seams to be happy (jstat shows very few
GCs). The CPU is the problem, and ES hot-threads point to the Scorer module:
Elasticsearch Hot-Threads · GitHub

SUMMARY/ASSUMPTIONS

  • Our database size isn't very big and the query not very complex.
  • ES is designed for huge amount of data, but the key is
    clustering/sharding: Data distribution to many servers means smaller
    indices, smaller indices leads to fewer CPU load and short response times.
  • So, our database isn't big, but to big for a single CPU and this means
    especially low performance (virtual) CPUs can only be used in sharding
    environments.

If we don't want to lost the provider independency, we have only the
following two options:

  1. Simpler query (I think not possible in our case)
  2. Smaller database

QUESTIONS

Are our assumptions correct? Especially:

  • Is clustering/sharding (also small indices) the main key to
    performance, that means the only possibility to prevent overloaded
    (virtual) CPUs?
  • Is it right that clustering is only useful/possible in LANs?
  • Do you have any ES configuration or architecture hints regarding our
    preference for using multiple hosting providers?

Thank you. Rgds
Fin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7e1b3a52-23b4-4a22-8433-985e07ae7904%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Your setup looks quite uncommon but I do not comment on that any further as
you seem to know what you do, just the idea of creating single shard
clusters and distributing them across data centers is counterintuitive to
the advantages of an ES installation.

I recommend revising your settings, they look very odd. Why did you change
the defaults? The heap size of 512m, this is definitely preventing more
shards.

Short answers to your questions

  • no, clustering / sharding is not the main key to performance, it is
    crucial for scalability

  • you can cluster not only in LANs, but over all kinds of hosts as long as
    they can respond fast "enough". Important for ES nodes is low latency on
    the network. High latency between nodes induces instabilities into ES which
    makes life hard as an operator.

  • if you mean client perspective, you should look at the Tribe node that
    can span multiple clusters
    Elasticsearch Platform — Find real-time answers at scale | Elastic

Jörg

On Mon, Jul 7, 2014 at 3:51 PM, 'Fin Sekun' via elasticsearch <
elasticsearch@googlegroups.com> wrote:

QUESTIONS

Are our assumptions correct? Especially:

  • Is clustering/sharding (also small indices) the main key to performance,
    that means the only possibility to prevent overloaded (virtual) CPUs?
  • Is it right that clustering is only useful/possible in LANs?
  • Do you have any ES configuration or architecture hints regarding our
    preference for using multiple hosting providers?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFdT_08OkwAYy10zOizv9%2BJCpcLRbDYHq%2B2S5Tbq1ViYA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Your query looks weird it says filtered but has no filter. It specifies boost but has no sort. I would remove the filtered, either remove boost or force the sort on score and ensure I was using index options of offsets (this also includes term frequencies and doc counts and pretty much everything precompiled to do the TF idf thing)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9d9fd340-a001-42e5-afa1-b1a78bdca111%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

My working assumption had been that elasticsearch executes queries across all shards in parallel and then merges the results. So maybe shards <= cpu cores would help in this case where there is only one concurrent query. But I have never tested this assumption, out of curiosity during the 20 shard test did you still only see 1 cpu being used? Did you try 2 shards and get the same results?

On Jul 20, 2014, at 1:01 AM, 'Fin Sekun' via elasticsearch elasticsearch@googlegroups.com wrote:

Hi Kireet, thanks for your answer and sorry for the late response. More shards doesn't help. It will slow down the system because each shard takes quite some overhead to maintain a Lucene index and, the smaller the shards, the bigger the overhead. Having more shards enhances the indexing performance and allows to distribute a big index across machines, but I don't have a cluster with a lot of machines. I could observe this negative effects while testing with 20 shards.

It would be very cool if somebody could answer/comment to the question summarized at the end of my post. Thanks again.

On Friday, July 11, 2014 3:02:50 AM UTC+2, Kireet Reddy wrote:
I would test using multiple primary shards on a single machine. Since your dataset seems to fit into RAM, this could help for these longer latency queries.

On Thursday, July 10, 2014 12:24:26 AM UTC-7, Fin Sekun wrote:
Any hints?

On Monday, July 7, 2014 3:51:19 PM UTC+2, Fin Sekun wrote:

Hi,

SCENARIO

Our Elasticsearch database has ~2.5 million entries. Each entry has the three analyzed fields "match", "sec_match" and "thi_match" (all contains 3-20 words) that will be used in this query:
gist:a8d1142512e5625e4e91 · GitHub

ES runs on two types of servers:
(1) Real servers (system has direct access to real CPUs, no virtualization) of newest generation - Very performant!
(2) Cloud servers with virtualized CPUs - Poor CPUs, but this is generic for cloud services.

See gist:3098b142c2bab51feecc · GitHub for (1) and (2) CPU details.

ES settings:
ES version 1.2.0 (jdk1.8.0_05)
ES_HEAP_SIZE = 512m (we also tested with 1024m with same results)
vm.max_map_count = 262144
ulimit -n 64000
ulimit -l unlimited
index.number_of_shards: 1
index.number_of_replicas: 0
index.store.type: mmapfs
threadpool.search.type: fixed
threadpool.search.size: 75
threadpool.search.queue_size: 5000

Infrastructure:
As you can see above, we don't use the cluster feature of ES (1 shard, 0 replicas). The reason is that our hosting infrastructure is based on different providers.
Upside: We aren't dependent on a single hosting provider. Downside: Our servers aren't in the same LAN.

This means:

  • We cannot use ES sharding, because synchronisation via WAN (internet) seems not a useful solution.
  • So, every ES-server has the complete dataset and we configured only one shard and no replicas for higher performance.
  • We have a distribution process that updates the ES data on every host frequently. This process is fine for us, because updates aren't very often and perfect just-in-time ES synchronisation isn't necessary for our business case.
  • If a server goes down/crashs, the central loadbalancer removes it (the resulting minimal packet lost is acceptable).

PROBLEM

For long query terms (6 and more keywords), we have very high CPU loads, even on the high performance server (1), and this leads to high response times: 1-4sec on server (1), 8-20sec on server (2). The system parameters while querying:

  • Very high load (usually 100%) for the thread responsible CPU (the other CPUs are idle in our test scenario)
  • No I/O load (the harddisks are fine)
  • No RAM bottlenecks

So, we think the file caching is working fine, because we have no I/O problems and the garbage collector seams to be happy (jstat shows very few GCs). The CPU is the problem, and ES hot-threads point to the Scorer module:
Elasticsearch Hot-Threads · GitHub

SUMMARY/ASSUMPTIONS

  • Our database size isn't very big and the query not very complex.
  • ES is designed for huge amount of data, but the key is clustering/sharding: Data distribution to many servers means smaller indices, smaller indices leads to fewer CPU load and short response times.
  • So, our database isn't big, but to big for a single CPU and this means especially low performance (virtual) CPUs can only be used in sharding environments.

If we don't want to lost the provider independency, we have only the following two options:

  1. Simpler query (I think not possible in our case)
  2. Smaller database

QUESTIONS

Are our assumptions correct? Especially:

  • Is clustering/sharding (also small indices) the main key to performance, that means the only possibility to prevent overloaded (virtual) CPUs?
  • Is it right that clustering is only useful/possible in LANs?
  • Do you have any ES configuration or architecture hints regarding our preference for using multiple hosting providers?

Thank you. Rgds
Fin

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/TpEboTt_8FY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7e1b3a52-23b4-4a22-8433-985e07ae7904%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ED431C9F-670C-4D2C-8F44-1F9E82A8D155%40feedly.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for your answers.

@Jörg:
Of course sharding is primarly curcial for scalability, but it seems (and
nobody disagreed in this posts) that big indices lead to high CPU load, so
sharding is curcial for performance too.

@smonasco:

  • The query will be put together by an application that sets always a
    filter field. I would be suprised if ES doesn't efficiently ignore this
    part.
  • There is no sort field, so ES sorts by score.
  • Offsets etc. are default for alle analyzed fields.

@Kireet:
Hm, I think only one CPU was being used, but I'm not sure anymore, the test
was a long time ago. As soon as I have time I will test with number of
shards smaller than number of CPUs.

On Monday, July 21, 2014 5:46:38 PM UTC+2, Kireet Reddy wrote:

My working assumption had been that elasticsearch executes queries across
all shards in parallel and then merges the results. So maybe shards <= cpu
cores would help in this case where there is only one concurrent query. But
I have never tested this assumption, out of curiosity during the 20 shard
test did you still only see 1 cpu being used? Did you try 2 shards and get
the same results?

On Jul 20, 2014, at 1:01 AM, 'Fin Sekun' via elasticsearch <
elasti...@googlegroups.com <javascript:>> wrote:

Hi Kireet, thanks for your answer and sorry for the late response. More
shards doesn't help. It will slow down the system because each shard takes
quite some overhead to maintain a Lucene index and, the smaller the shards,
the bigger the overhead. Having more shards enhances the indexing
performance and allows to distribute a big index across machines, but I
don't have a cluster with a lot of machines. I could observe this negative
effects while testing with 20 shards.

It would be very cool if somebody could answer/comment to the question
summarized at the end of my post. Thanks again.

On Friday, July 11, 2014 3:02:50 AM UTC+2, Kireet Reddy wrote:

I would test using multiple primary shards on a single machine. Since
your dataset seems to fit into RAM, this could help for these longer
latency queries.

On Thursday, July 10, 2014 12:24:26 AM UTC-7, Fin Sekun wrote:

Any hints?

On Monday, July 7, 2014 3:51:19 PM UTC+2, Fin Sekun wrote:

Hi,

SCENARIO

Our Elasticsearch database has ~2.5 million entries. Each entry has the
three analyzed fields "match", "sec_match" and "thi_match" (all contains
3-20 words) that will be used in this query:
gist:a8d1142512e5625e4e91 · GitHub

ES runs on two types of servers:
(1) Real servers (system has direct access to real CPUs, no
virtualization) of newest generation - Very performant!
(2) Cloud servers with virtualized CPUs - Poor CPUs, but this is
generic for cloud services.

See gist:3098b142c2bab51feecc · GitHub for (1) and
(2) CPU details.

ES settings:
ES version 1.2.0 (jdk1.8.0_05)
ES_HEAP_SIZE = 512m (we also tested with 1024m with same results)
vm.max_map_count = 262144
ulimit -n 64000
ulimit -l unlimited
index.number_of_shards: 1
index.number_of_replicas: 0
index.store.type: mmapfs
threadpool.search.type: fixed
threadpool.search.size: 75
threadpool.search.queue_size: 5000

Infrastructure:
As you can see above, we don't use the cluster feature of ES (1 shard,
0 replicas). The reason is that our hosting infrastructure is based on
different providers.
Upside: We aren't dependent on a single hosting provider. Downside: Our
servers aren't in the same LAN.

This means:

  • We cannot use ES sharding, because synchronisation via WAN (internet)
    seems not a useful solution.
  • So, every ES-server has the complete dataset and we configured only
    one shard and no replicas for higher performance.
  • We have a distribution process that updates the ES data on every host
    frequently. This process is fine for us, because updates aren't very often
    and perfect just-in-time ES synchronisation isn't necessary for our
    business case.
  • If a server goes down/crashs, the central loadbalancer removes it
    (the resulting minimal packet lost is acceptable).

PROBLEM

For long query terms (6 and more keywords), we have very high CPU
loads, even on the high performance server (1), and this leads to high
response times: 1-4sec on server (1), 8-20sec on server (2). The system
parameters while querying:

  • Very high load (usually 100%) for the thread responsible CPU (the
    other CPUs are idle in our test scenario)
  • No I/O load (the harddisks are fine)
  • No RAM bottlenecks

So, we think the file caching is working fine, because we have no I/O
problems and the garbage collector seams to be happy (jstat shows very few
GCs). The CPU is the problem, and ES hot-threads point to the Scorer module:
Elasticsearch Hot-Threads · GitHub

SUMMARY/ASSUMPTIONS

  • Our database size isn't very big and the query not very complex.
  • ES is designed for huge amount of data, but the key is
    clustering/sharding: Data distribution to many servers means smaller
    indices, smaller indices leads to fewer CPU load and short response times.
  • So, our database isn't big, but to big for a single CPU and this
    means especially low performance (virtual) CPUs can only be used in
    sharding environments.

If we don't want to lost the provider independency, we have only the
following two options:

  1. Simpler query (I think not possible in our case)
  2. Smaller database

QUESTIONS

Are our assumptions correct? Especially:

  • Is clustering/sharding (also small indices) the main key to
    performance, that means the only possibility to prevent overloaded
    (virtual) CPUs?
  • Is it right that clustering is only useful/possible in LANs?
  • Do you have any ES configuration or architecture hints regarding our
    preference for using multiple hosting providers?

Thank you. Rgds
Fin

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TpEboTt_8FY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7e1b3a52-23b4-4a22-8433-985e07ae7904%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7e1b3a52-23b4-4a22-8433-985e07ae7904%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fac66eb6-8b1f-4aaa-848b-fe54654dfdc5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I don't think big shards lead to high court load during query time.

Burst tries, the tree structure behind Lucene's inverse indexes, are setup
so that you only have to do up to 2 string compares (and it would seem
usually just one) to transverse them.

So your CPU load should be dependent on the number of (term, field)'s
you're matching * number of segments (of course there may be some paged in
factor of being able to fully utilize a cycle which might be dependent on
the depth of the tree (which in turn keeps common terms closer to the top
to prevent too many reads and the depth for uncommon terms is maxed at term
length) and what other threads are currently running.

Given the way tree transverses (and burst tries un particular) work I would
only imagine fewer trees would mean fewer cycles. With one more read and
no more full string comparisons you can get exponentially farther.
On Jul 27, 2014 5:40 AM, "'Fin Sekun' via elasticsearch" <
elasticsearch@googlegroups.com> wrote:

Thanks for your answers.

@Jörg:
Of course sharding is primarly curcial for scalability, but it seems (and
nobody disagreed in this posts) that big indices lead to high CPU load, so
sharding is curcial for performance too.

@smonasco:

  • The query will be put together by an application that sets always a
    filter field. I would be suprised if ES doesn't efficiently ignore this
    part.
  • There is no sort field, so ES sorts by score.
  • Offsets etc. are default for alle analyzed fields.

@Kireet:
Hm, I think only one CPU was being used, but I'm not sure anymore, the
test was a long time ago. As soon as I have time I will test with number of
shards smaller than number of CPUs.

On Monday, July 21, 2014 5:46:38 PM UTC+2, Kireet Reddy wrote:

My working assumption had been that elasticsearch executes queries across
all shards in parallel and then merges the results. So maybe shards <= cpu
cores would help in this case where there is only one concurrent query. But
I have never tested this assumption, out of curiosity during the 20 shard
test did you still only see 1 cpu being used? Did you try 2 shards and get
the same results?

On Jul 20, 2014, at 1:01 AM, 'Fin Sekun' via elasticsearch <
elasti...@googlegroups.com> wrote:

Hi Kireet, thanks for your answer and sorry for the late response. More
shards doesn't help. It will slow down the system because each shard takes
quite some overhead to maintain a Lucene index and, the smaller the shards,
the bigger the overhead. Having more shards enhances the indexing
performance and allows to distribute a big index across machines, but I
don't have a cluster with a lot of machines. I could observe this negative
effects while testing with 20 shards.

It would be very cool if somebody could answer/comment to the question
summarized at the end of my post. Thanks again.

On Friday, July 11, 2014 3:02:50 AM UTC+2, Kireet Reddy wrote:

I would test using multiple primary shards on a single machine. Since
your dataset seems to fit into RAM, this could help for these longer
latency queries.

On Thursday, July 10, 2014 12:24:26 AM UTC-7, Fin Sekun wrote:

Any hints?

On Monday, July 7, 2014 3:51:19 PM UTC+2, Fin Sekun wrote:

Hi,

SCENARIO

Our Elasticsearch database has ~2.5 million entries. Each entry has
the three analyzed fields "match", "sec_match" and "thi_match" (all
contains 3-20 words) that will be used in this query:
gist:a8d1142512e5625e4e91 · GitHub

ES runs on two types of servers:
(1) Real servers (system has direct access to real CPUs, no
virtualization) of newest generation - Very performant!
(2) Cloud servers with virtualized CPUs - Poor CPUs, but this is
generic for cloud services.

See gist:3098b142c2bab51feecc · GitHub for (1)
and (2) CPU details.

ES settings:
ES version 1.2.0 (jdk1.8.0_05)
ES_HEAP_SIZE = 512m (we also tested with 1024m with same results)
vm.max_map_count = 262144
ulimit -n 64000
ulimit -l unlimited
index.number_of_shards: 1
index.number_of_replicas: 0
index.store.type: mmapfs
threadpool.search.type: fixed
threadpool.search.size: 75
threadpool.search.queue_size: 5000

Infrastructure:
As you can see above, we don't use the cluster feature of ES (1 shard,
0 replicas). The reason is that our hosting infrastructure is based on
different providers.
Upside: We aren't dependent on a single hosting provider. Downside:
Our servers aren't in the same LAN.

This means:

  • We cannot use ES sharding, because synchronisation via WAN
    (internet) seems not a useful solution.
  • So, every ES-server has the complete dataset and we configured only
    one shard and no replicas for higher performance.
  • We have a distribution process that updates the ES data on every
    host frequently. This process is fine for us, because updates aren't very
    often and perfect just-in-time ES synchronisation isn't necessary for our
    business case.
  • If a server goes down/crashs, the central loadbalancer removes it
    (the resulting minimal packet lost is acceptable).

PROBLEM

For long query terms (6 and more keywords), we have very high CPU
loads, even on the high performance server (1), and this leads to high
response times: 1-4sec on server (1), 8-20sec on server (2). The system
parameters while querying:

  • Very high load (usually 100%) for the thread responsible CPU (the
    other CPUs are idle in our test scenario)
  • No I/O load (the harddisks are fine)
  • No RAM bottlenecks

So, we think the file caching is working fine, because we have no I/O
problems and the garbage collector seams to be happy (jstat shows very few
GCs). The CPU is the problem, and ES hot-threads point to the Scorer module:
Elasticsearch Hot-Threads · GitHub

SUMMARY/ASSUMPTIONS

  • Our database size isn't very big and the query not very complex.
  • ES is designed for huge amount of data, but the key is
    clustering/sharding: Data distribution to many servers means smaller
    indices, smaller indices leads to fewer CPU load and short response times.
  • So, our database isn't big, but to big for a single CPU and this
    means especially low performance (virtual) CPUs can only be used in
    sharding environments.

If we don't want to lost the provider independency, we have only the
following two options:

  1. Simpler query (I think not possible in our case)
  2. Smaller database

QUESTIONS

Are our assumptions correct? Especially:

  • Is clustering/sharding (also small indices) the main key to
    performance, that means the only possibility to prevent overloaded
    (virtual) CPUs?
  • Is it right that clustering is only useful/possible in LANs?
  • Do you have any ES configuration or architecture hints regarding our
    preference for using multiple hosting providers?

Thank you. Rgds
Fin

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/TpEboTt_8FY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/7e1b3a52-23b4-4a22-8433-985e07ae7904%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7e1b3a52-23b4-4a22-8433-985e07ae7904%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TpEboTt_8FY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fac66eb6-8b1f-4aaa-848b-fe54654dfdc5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/fac66eb6-8b1f-4aaa-848b-fe54654dfdc5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAFDU5W%2B_dt7wHC7qFcdmKZ-KPWUCNOT6qb%2BiP6R0wHcjE5Qruw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I think you confuse things and I disagree.

If you see high CPU load, it is because your system runs tight of resources
and tries harder to tackle these challenge automatically, e.g. by higher
GC. It is not ES generating higher load just because of the index document
count. There is no relation between a system tight on resources and the
index document count of ES.

If you add nodes, the shards distribute over more nodes, this means
scalability. When you scale out your system, you can process a higher
number of queries concurrently. So with constant index document count, you
have all kinds of CPU load, low and high.

Jörg

On Sun, Jul 27, 2014 at 1:40 PM, 'Fin Sekun' via elasticsearch <
elasticsearch@googlegroups.com> wrote:

Thanks for your answers.

@Jörg:
Of course sharding is primarly curcial for scalability, but it seems (and
nobody disagreed in this posts) that big indices lead to high CPU load, so
sharding is curcial for performance too.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHecJ5cXAYN30rTKR4xqo6GzzonQXAnL703HNBWH7XFVQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.