I plan to use elasticsearch as documentation retrieval engine which will
serve hundreds of millions of documents, but the query rate will be low.
The ES cluster will probably receive a few queries only each hour.
We are planning to use ec2 m2.2xlarge instance, each with 32G memory and 4
CPU cores, so I like to run 4 ES nodes on each ec2 instance to maximize the
CPU utilization rate. In this case, is it beneficial to run multiple nodes
on same machine?
My own experience with Solr is that it does help to use resources more
efficiently.
a) if you start many JVMs, you create a JVM-induced overhead. That is,
JVMs compete for the resources the OS provide (CPU, network, memory).
Because the OS must decide which JVM does get which resources, it takes
more time and space to make decisions, and this is not negelectible. The
more JVMs you execute in parallel, the higher the risk of overall system
degradation and in many cases the risk of paging (swapping) is higher.
b) the ES code is optimized for scalability. What does that mean? You
can increase the parameters for CPU (threads), memory (heap) and network
(netty pools) for the ES JVM and this increases the overall power as
much as your machine can get along with it. There is no reason why you
should not dedicate a whole machine to one single ES node.
c) a single ES JVM can manage hundreds or thousands of Lucene indexes at
once. This is done by index sharding and automatic workload
distribution. Each node can hold many indices with many index shards. An
ES node does not restrict you to a model of a single index with a single
shard.
Thanks for the info, very useful. So basically I can run one ES instance
which holds multiple shards, and once each shard gets big, I can migrate
them to separate machines?
Thanks,
Ming
On Tuesday, March 19, 2013 5:18:02 PM UTC-7, Jörg Prante wrote:
No, it is not beneficial.
Here are the reasons:
a) if you start many JVMs, you create a JVM-induced overhead. That is,
JVMs compete for the resources the OS provide (CPU, network, memory).
Because the OS must decide which JVM does get which resources, it takes
more time and space to make decisions, and this is not negelectible. The
more JVMs you execute in parallel, the higher the risk of overall system
degradation and in many cases the risk of paging (swapping) is higher.
b) the ES code is optimized for scalability. What does that mean? You
can increase the parameters for CPU (threads), memory (heap) and network
(netty pools) for the ES JVM and this increases the overall power as
much as your machine can get along with it. There is no reason why you
should not dedicate a whole machine to one single ES node.
c) a single ES JVM can manage hundreds or thousands of Lucene indexes at
once. This is done by index sharding and automatic workload
distribution. Each node can hold many indices with many index shards. An
ES node does not restrict you to a model of a single index with a single
shard.
Totally agree with 32G machines, but as memory gets cheaper and cheaper I'm
curious if anyone has actually done any benchmarking or stress tests on the
single vs multi node with large memory machines.
We actually run 2 nodes (22G each) on our 64G machines.
a) so we can have -XX:+UseCompressedOops
b) with the theory (untested) that GC pauses will be faster/less often/...
On Tuesday, March 19, 2013 8:18:02 PM UTC-4, Jörg Prante wrote:
No, it is not beneficial.
Here are the reasons:
a) if you start many JVMs, you create a JVM-induced overhead. That is,
JVMs compete for the resources the OS provide (CPU, network, memory).
Because the OS must decide which JVM does get which resources, it takes
more time and space to make decisions, and this is not negelectible. The
more JVMs you execute in parallel, the higher the risk of overall system
degradation and in many cases the risk of paging (swapping) is higher.
b) the ES code is optimized for scalability. What does that mean? You
can increase the parameters for CPU (threads), memory (heap) and network
(netty pools) for the ES JVM and this increases the overall power as
much as your machine can get along with it. There is no reason why you
should not dedicate a whole machine to one single ES node.
c) a single ES JVM can manage hundreds or thousands of Lucene indexes at
once. This is done by index sharding and automatic workload
distribution. Each node can hold many indices with many index shards. An
ES node does not restrict you to a model of a single index with a single
shard.
I have heard that multi core is maximum utlized with different process
rather than different threads.
If that is true and if the machine has many cores , wont muliple instance
be a good idea ?
Totally agree with 32G machines, but as memory gets cheaper and cheaper
I'm curious if anyone has actually done any benchmarking or stress tests on
the single vs multi node with large memory machines.
We actually run 2 nodes (22G each) on our 64G machines.
a) so we can have -XX:+UseCompressedOops
b) with the theory (untested) that GC pauses will be faster/less often/...
On Tuesday, March 19, 2013 8:18:02 PM UTC-4, Jörg Prante wrote:
No, it is not beneficial.
Here are the reasons:
a) if you start many JVMs, you create a JVM-induced overhead. That is,
JVMs compete for the resources the OS provide (CPU, network, memory).
Because the OS must decide which JVM does get which resources, it takes
more time and space to make decisions, and this is not negelectible. The
more JVMs you execute in parallel, the higher the risk of overall system
degradation and in many cases the risk of paging (swapping) is higher.
b) the ES code is optimized for scalability. What does that mean? You
can increase the parameters for CPU (threads), memory (heap) and network
(netty pools) for the ES JVM and this increases the overall power as
much as your machine can get along with it. There is no reason why you
should not dedicate a whole machine to one single ES node.
c) a single ES JVM can manage hundreds or thousands of Lucene indexes at
once. This is done by index sharding and automatic workload
distribution. Each node can hold many indices with many index shards. An
ES node does not restrict you to a model of a single index with a single
shard.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.