Best cluster environment for search


(Marcelo Paes Rech) #1

Hi guys,

I'm looking for an article or a guide for the best cluster configuration. I
read a lot of articles like "change this configuration" and "you must
create X shards per node" but I didn't saw nothing like ElasticSearch
Official guide for creating a cluster.

What I would like to know are informations like.

  • How to calculate how many shards will be good for the cluster.
  • How many shards do we need per node? And if this is variable, how do I
    calculate this?
  • How much memory do I need per node and how many nodes?

I think ElasticSearch is well documentated. But it is very fragmented.

Best regards.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b8894495-e64b-4796-9eb4-e49e1b9ce556%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Nik Everett) #2

On Mon, Jun 2, 2014 at 3:52 PM, Marcelo Paes Rech <marcelopaesrech@gmail.com

wrote:

Hi guys,

I'm looking for an article or a guide for the best cluster configuration.
I read a lot of articles like "change this configuration" and "you must
create X shards per node" but I didn't saw nothing like ElasticSearch
Official guide for creating a cluster.

What I would like to know are informations like.

  • How to calculate how many shards will be good for the cluster.
  • How many shards do we need per node? And if this is variable, how do I
    calculate this?
  • How much memory do I need per node and how many nodes?

I think ElasticSearch is well documentated. But it is very fragmented.

For some of these that is because "it depends" is the answer. For example,
you'll want larger heaps for aggregations and faceting.

There are some rules of thumb:

  1. Set Elasticsearch's heap memory to 1/2 of ram but not more then 30GB.
    Bigger then that and the JVM can't do pointer compression and you
    effectively lose ram.
  2. #1 implies that having much more then 60GB of ram on each node doesn't
    make a big difference. It helps but its not really as good as having more
    nodes.
  3. The most efficient efficient way of sharding is likely one shard on
    each node. So if you have 9 nodes and a replication factor of 2 (so 3
    total copies) then 3 shards is likely to be more efficient then having 2 or
  4. But this only really matters when those shards get lots of traffic.
    And it breaks down a bit when you get lots of nodes. And the in presence
    of routing. Its complicated.

But these are really just starting points, safe-ish defaults.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd11f_mV88p1z2F%2Bm543FEcZkH5va%3DSdF-T0vFoNwuszng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Marcelo Paes Rech) #3

Thanks for your reply Nikolas. It helps a lot.

And about the quantity of documents of each shard, or size of each shard.
And the need of no data nodes or only master nodes. When is it necessary?

Some tests I did, when I increased request's number (like 100 users at same
moment, and redo it again and again), 5 nodes with 1 shard and 2 replicas
each and 16Gb RAM (8Gb for ES and 8Gb for OS) weren't enough. The response
time start to increase more than 5s (I think less than 1s, in this case,
would be acceptable) .

This test has a lot of documents (something like 14 millions).

Thanks. Regards.

Em segunda-feira, 2 de junho de 2014 17h09min04s UTC-3, Nikolas Everett
escreveu:

On Mon, Jun 2, 2014 at 3:52 PM, Marcelo Paes Rech <marcelo...@gmail.com
<javascript:>> wrote:

Hi guys,

I'm looking for an article or a guide for the best cluster configuration.
I read a lot of articles like "change this configuration" and "you must
create X shards per node" but I didn't saw nothing like ElasticSearch
Official guide for creating a cluster.

What I would like to know are informations like.

  • How to calculate how many shards will be good for the cluster.
  • How many shards do we need per node? And if this is variable, how do I
    calculate this?
  • How much memory do I need per node and how many nodes?

I think ElasticSearch is well documentated. But it is very fragmented.

For some of these that is because "it depends" is the answer. For
example, you'll want larger heaps for aggregations and faceting.

There are some rules of thumb:

  1. Set Elasticsearch's heap memory to 1/2 of ram but not more then 30GB.
    Bigger then that and the JVM can't do pointer compression and you
    effectively lose ram.
  2. #1 implies that having much more then 60GB of ram on each node doesn't
    make a big difference. It helps but its not really as good as having more
    nodes.
  3. The most efficient efficient way of sharding is likely one shard on
    each node. So if you have 9 nodes and a replication factor of 2 (so 3
    total copies) then 3 shards is likely to be more efficient then having 2 or
  4. But this only really matters when those shards get lots of traffic.
    And it breaks down a bit when you get lots of nodes. And the in presence
    of routing. Its complicated.

But these are really just starting points, safe-ish defaults.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #4

Can you show your test code?

You seem to look at the wrong settings - by adjusting node number, shard
number, replica number alone, you can not find out the maximum node
performance. E.g. concurrency settings, index optimizations, query
optimizations, thread pooling, and most of all, fast disk subsystem I/O is
important.

Jörg

On Wed, Jun 4, 2014 at 12:18 AM, Marcelo Paes Rech <
marcelopaesrech@gmail.com> wrote:

Thanks for your reply Nikolas. It helps a lot.

And about the quantity of documents of each shard, or size of each shard.
And the need of no data nodes or only master nodes. When is it necessary?

Some tests I did, when I increased request's number (like 100 users at
same moment, and redo it again and again), 5 nodes with 1 shard and 2
replicas each and 16Gb RAM (8Gb for ES and 8Gb for OS) weren't enough. The
response time start to increase more than 5s (I think less than 1s, in
this case, would be acceptable) .

This test has a lot of documents (something like 14 millions).

Thanks. Regards.

Em segunda-feira, 2 de junho de 2014 17h09min04s UTC-3, Nikolas Everett
escreveu:

On Mon, Jun 2, 2014 at 3:52 PM, Marcelo Paes Rech marcelo...@gmail.com
wrote:

Hi guys,

I'm looking for an article or a guide for the best cluster
configuration. I read a lot of articles like "change this configuration"
and "you must create X shards per node" but I didn't saw nothing like
ElasticSearch Official guide for creating a cluster.

What I would like to know are informations like.

  • How to calculate how many shards will be good for the cluster.
  • How many shards do we need per node? And if this is variable, how do I
    calculate this?
  • How much memory do I need per node and how many nodes?

I think ElasticSearch is well documentated. But it is very fragmented.

For some of these that is because "it depends" is the answer. For
example, you'll want larger heaps for aggregations and faceting.

There are some rules of thumb:

  1. Set Elasticsearch's heap memory to 1/2 of ram but not more then
    30GB. Bigger then that and the JVM can't do pointer compression and you
    effectively lose ram.
  2. #1 implies that having much more then 60GB of ram on each node
    doesn't make a big difference. It helps but its not really as good as
    having more nodes.
  3. The most efficient efficient way of sharding is likely one shard on
    each node. So if you have 9 nodes and a replication factor of 2 (so 3
    total copies) then 3 shards is likely to be more efficient then having 2 or
  4. But this only really matters when those shards get lots of traffic.
    And it breaks down a bit when you get lots of nodes. And the in presence
    of routing. Its complicated.

But these are really just starting points, safe-ish defaults.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGBBwbpCnXPR%3D9r-yTngRbAUcG_wiqMTN8Hk6XeggPHLw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Marcelo Paes Rech) #5

Hi Jörg. Thanks for your reply.

Here is my filter.
{"filter":
{
"terms" : {
"_id" : [ "QSxrbEM8TKe5zr8931xBjA", "wj63ghegRwC6qLsWq2chkA",
"hYEhDbAqQwSRxhYfvDgFkg", "4bZmPE1fTYqijphRyyWiuQ",
"Fhq53yYyT3CEw6vclKu_NA", "XL2atBraTEyx57MefjFVhA",
"951i0dZkT064FlQkzHnnWA", "O8Ixbir1TrGT_IA3wKfsHg",
"8k4U7KsuTmsThqxy-5YaKw", "GNOoQTHglf22kzcE7EOf8g",
"-RQeY48fTg2kYnh2M4E1cQ", "u8DGBdfVR9WRVj6d9E4Ebw",
"WFHSXd7UQvCMYFBhFcTsng", "qnQ7q7FyTsg397lM1EWgqA",
"wRQtUzdMRy2qOkMCNxdpgA", "Ll83iglxSUS_Gs7mjkMt8w",
"d2sxZ1oBTfuvAfov5EJ0iw", "cyht-vB4Q-mMSg9N5jcGXg",
"bNSVaO47QTOCkfJhWo0qjg", "BHuhm55IRerKnynJ8WgFTw",
"fHKA4PF2QteWm8E7dW7CAw", "DLE6A7tyQJ-zcKcCa6IPSA",
"qfelTW7-SuGRQ0GKbngARA", "R7VHHJhYsUqfuxYof8BJ8w",
"W4PqiJfPSlSFjVKFsGkA4Q", "Juq62zOsRdheuW3O6Gb2KA",
"U9v0IKj_RrgRNjE31ZTt2g", "uNHa0kOOT5qjPpzxZcs35A",
"SwOgVNgIRwyVU3pEEycBuQ", "LaEpxFGIQgCArsNZ2rd4Pw",
"CiJ9gouZsbmTtxTWx7w6lA", "TaQV_I01RfCq3B6uAtIBoQ",
"9Jpjo5k-RlGfLVLF6nDgze", "57YpjRdASsrrae-RD3spog",
"bmA4EWFSTiKUaDzaNcCFKQ", "Fui9z_UbRe6AY1VhAr8Crw",
"2PORr5BzSDOmBXgmQkO5Zg", "snfwTmtuTv-uj5mOWSJpgA",
"0nHIrtePSaeW8aWArh_Mrg", "s0g9QHnjTgWX3rCIu1g0Hg",
"Jl67fACuQvCFgZxXAFtDOg" ],
"_cache" : true,
"_cache_key" : "my_terms_cache"
}
}
}

I already used "ids filter" but I got same behaviour. One thing that I
realized is that one of the cluster's nodes is increasing the Search Thread
Pool (something like Queue: 50 and Count: 47) and the others don't
(something like Queue: 0 and Count: 1). If I remove this node from the
cluster another one starts with the same problem.

My current environment is:

  • 7 Data nodes with 16Gb (8Gb for ES)and 8 cores each one;
  • 4 Load balancer Nodes (no data, no master) with 4Gb (3Gb for ES) and 8
    cores each one;
  • 4 MasterNodes (only master, no data) with 4Gb (3Gb for ES) and 8 cores
    each one;
  • Thread Pool Search 47 (the others are standard config);
  • 7 Shards and 2 replicas Index;
  • 14.6Gb Index size (14.524.273 documents);

I'm executing this filter with 50 concurrent users.

Regards

Em terça-feira, 3 de junho de 2014 20h33min45s UTC-3, Jörg Prante escreveu:

Can you show your test code?

You seem to look at the wrong settings - by adjusting node number, shard
number, replica number alone, you can not find out the maximum node
performance. E.g. concurrency settings, index optimizations, query
optimizations, thread pooling, and most of all, fast disk subsystem I/O is
important.

Jörg

On Wed, Jun 4, 2014 at 12:18 AM, Marcelo Paes Rech <marcelo...@gmail.com
<javascript:>> wrote:

Thanks for your reply Nikolas. It helps a lot.

And about the quantity of documents of each shard, or size of each shard.
And the need of no data nodes or only master nodes. When is it necessary?

Some tests I did, when I increased request's number (like 100 users at
same moment, and redo it again and again), 5 nodes with 1 shard and 2
replicas each and 16Gb RAM (8Gb for ES and 8Gb for OS) weren't enough. The
response time start to increase more than 5s (I think less than 1s, in
this case, would be acceptable) .

This test has a lot of documents (something like 14 millions).

Thanks. Regards.

Em segunda-feira, 2 de junho de 2014 17h09min04s UTC-3, Nikolas Everett
escreveu:

On Mon, Jun 2, 2014 at 3:52 PM, Marcelo Paes Rech marcelo...@gmail.com
wrote:

Hi guys,

I'm looking for an article or a guide for the best cluster
configuration. I read a lot of articles like "change this configuration"
and "you must create X shards per node" but I didn't saw nothing like
ElasticSearch Official guide for creating a cluster.

What I would like to know are informations like.

  • How to calculate how many shards will be good for the cluster.
  • How many shards do we need per node? And if this is variable, how do
    I calculate this?
  • How much memory do I need per node and how many nodes?

I think ElasticSearch is well documentated. But it is very fragmented.

For some of these that is because "it depends" is the answer. For
example, you'll want larger heaps for aggregations and faceting.

There are some rules of thumb:

  1. Set Elasticsearch's heap memory to 1/2 of ram but not more then
    30GB. Bigger then that and the JVM can't do pointer compression and you
    effectively lose ram.
  2. #1 implies that having much more then 60GB of ram on each node
    doesn't make a big difference. It helps but its not really as good as
    having more nodes.
  3. The most efficient efficient way of sharding is likely one shard on
    each node. So if you have 9 nodes and a replication factor of 2 (so 3
    total copies) then 3 shards is likely to be more efficient then having 2 or
  4. But this only really matters when those shards get lots of traffic.
    And it breaks down a bit when you get lots of nodes. And the in presence
    of routing. Its complicated.

But these are really just starting points, safe-ish defaults.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d32487cb-db92-4b7a-b6b3-afd431beaf61%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #6

Why do you use terms on _id field and not the the ids filter? ids filter is
more efficient since it reuses the _uid field which is cached by default.

Do the terms in the query vary from query to query? If so, caching might
kill your heap.

Another possible issue is that your query is not distributed to all shards,
if the query does not vary from user to user in your test. If so, you
created a "hot spot", all the load from the 100 users wold go to a limited
number of node with a limited shard count.

The search thread pool seems small with 50 searches if you execute searches
for 100 users in parallel, this can lead to a congestion of the search
module. Why don't you use 100 (at least)?

Jörg

On Wed, Jun 4, 2014 at 2:40 PM, Marcelo Paes Rech <marcelopaesrech@gmail.com

wrote:

Hi Jörg. Thanks for your reply.

Here is my filter.
{"filter":
{
"terms" : {
"_id" : [ "QSxrbEM8TKe5zr8931xBjA", "wj63ghegRwC6qLsWq2chkA",
"hYEhDbAqQwSRxhYfvDgFkg", "4bZmPE1fTYqijphRyyWiuQ",
"Fhq53yYyT3CEw6vclKu_NA", "XL2atBraTEyx57MefjFVhA",
"951i0dZkT064FlQkzHnnWA", "O8Ixbir1TrGT_IA3wKfsHg",
"8k4U7KsuTmsThqxy-5YaKw", "GNOoQTHglf22kzcE7EOf8g",
"-RQeY48fTg2kYnh2M4E1cQ", "u8DGBdfVR9WRVj6d9E4Ebw",
"WFHSXd7UQvCMYFBhFcTsng", "qnQ7q7FyTsg397lM1EWgqA",
"wRQtUzdMRy2qOkMCNxdpgA", "Ll83iglxSUS_Gs7mjkMt8w",
"d2sxZ1oBTfuvAfov5EJ0iw", "cyht-vB4Q-mMSg9N5jcGXg",
"bNSVaO47QTOCkfJhWo0qjg", "BHuhm55IRerKnynJ8WgFTw",
"fHKA4PF2QteWm8E7dW7CAw", "DLE6A7tyQJ-zcKcCa6IPSA",
"qfelTW7-SuGRQ0GKbngARA", "R7VHHJhYsUqfuxYof8BJ8w",
"W4PqiJfPSlSFjVKFsGkA4Q", "Juq62zOsRdheuW3O6Gb2KA",
"U9v0IKj_RrgRNjE31ZTt2g", "uNHa0kOOT5qjPpzxZcs35A",
"SwOgVNgIRwyVU3pEEycBuQ", "LaEpxFGIQgCArsNZ2rd4Pw",
"CiJ9gouZsbmTtxTWx7w6lA", "TaQV_I01RfCq3B6uAtIBoQ",
"9Jpjo5k-RlGfLVLF6nDgze", "57YpjRdASsrrae-RD3spog",
"bmA4EWFSTiKUaDzaNcCFKQ", "Fui9z_UbRe6AY1VhAr8Crw",
"2PORr5BzSDOmBXgmQkO5Zg", "snfwTmtuTv-uj5mOWSJpgA",
"0nHIrtePSaeW8aWArh_Mrg", "s0g9QHnjTgWX3rCIu1g0Hg",
"Jl67fACuQvCFgZxXAFtDOg" ],
"_cache" : true,
"_cache_key" : "my_terms_cache"
}
}
}

I already used "ids filter" but I got same behaviour. One thing that I
realized is that one of the cluster's nodes is increasing the Search Thread
Pool (something like Queue: 50 and Count: 47) and the others don't
(something like Queue: 0 and Count: 1). If I remove this node from the
cluster another one starts with the same problem.

My current environment is:

  • 7 Data nodes with 16Gb (8Gb for ES)and 8 cores each one;
  • 4 Load balancer Nodes (no data, no master) with 4Gb (3Gb for ES) and 8
    cores each one;
  • 4 MasterNodes (only master, no data) with 4Gb (3Gb for ES) and 8 cores
    each one;
  • Thread Pool Search 47 (the others are standard config);
  • 7 Shards and 2 replicas Index;
  • 14.6Gb Index size (14.524.273 documents);

I'm executing this filter with 50 concurrent users.

Regards

Em terça-feira, 3 de junho de 2014 20h33min45s UTC-3, Jörg Prante escreveu:

Can you show your test code?

You seem to look at the wrong settings - by adjusting node number, shard
number, replica number alone, you can not find out the maximum node
performance. E.g. concurrency settings, index optimizations, query
optimizations, thread pooling, and most of all, fast disk subsystem I/O is
important.

Jörg

On Wed, Jun 4, 2014 at 12:18 AM, Marcelo Paes Rech marcelo...@gmail.com
wrote:

Thanks for your reply Nikolas. It helps a lot.

And about the quantity of documents of each shard, or size of each
shard. And the need of no data nodes or only master nodes. When is it
necessary?

Some tests I did, when I increased request's number (like 100 users at
same moment, and redo it again and again), 5 nodes with 1 shard and 2
replicas each and 16Gb RAM (8Gb for ES and 8Gb for OS) weren't enough. The
response time start to increase more than 5s (I think less than 1s, in
this case, would be acceptable) .

This test has a lot of documents (something like 14 millions).

Thanks. Regards.

Em segunda-feira, 2 de junho de 2014 17h09min04s UTC-3, Nikolas Everett
escreveu:

On Mon, Jun 2, 2014 at 3:52 PM, Marcelo Paes Rech <marcelo...@gmail.com

wrote:

Hi guys,

I'm looking for an article or a guide for the best cluster
configuration. I read a lot of articles like "change this configuration"
and "you must create X shards per node" but I didn't saw nothing like
ElasticSearch Official guide for creating a cluster.

What I would like to know are informations like.

  • How to calculate how many shards will be good for the cluster.
  • How many shards do we need per node? And if this is variable, how do
    I calculate this?
  • How much memory do I need per node and how many nodes?

I think ElasticSearch is well documentated. But it is very fragmented.

For some of these that is because "it depends" is the answer. For
example, you'll want larger heaps for aggregations and faceting.

There are some rules of thumb:

  1. Set Elasticsearch's heap memory to 1/2 of ram but not more then
    30GB. Bigger then that and the JVM can't do pointer compression and you
    effectively lose ram.
  2. #1 implies that having much more then 60GB of ram on each node
    doesn't make a big difference. It helps but its not really as good as
    having more nodes.
  3. The most efficient efficient way of sharding is likely one shard on
    each node. So if you have 9 nodes and a replication factor of 2 (so 3
    total copies) then 3 shards is likely to be more efficient then having 2 or
  4. But this only really matters when those shards get lots of traffic.
    And it breaks down a bit when you get lots of nodes. And the in presence
    of routing. Its complicated.

But these are really just starting points, safe-ish defaults.

Nik

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d32487cb-db92-4b7a-b6b3-afd431beaf61%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d32487cb-db92-4b7a-b6b3-afd431beaf61%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGFC5GxRVG_k_C%3D-FBnUXaRx%3DkfHWS_5srKyjFBKM3_%2Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Marcelo Paes Rech) #7

Hi Jörg. Thanks for your reply again.

As I said, I already had used ids filter, but I got the same behaviour.

I realized what was wrong. Maybe it could be a bug in ES or not. When I
executed the filter I included "from" and "size" attibutes. In this case
"size" was 999999, but the final result would be just 10 documents.
Aparently ES pre-allocates the objects that I say I will use (maybe for
performance reasons), but if the final result is not the total (999999), ES
doesn't remove remaining pre-allocated objects until the memory (heap) is
full.

I changed the size attribute to 10 and heap became stable.

That's it. Thanks.

Regards.

Em quarta-feira, 4 de junho de 2014 19h54min15s UTC-3, Jörg Prante escreveu:

Why do you use terms on _id field and not the the ids filter? ids filter
is more efficient since it reuses the _uid field which is cached by default.

Do the terms in the query vary from query to query? If so, caching might
kill your heap.

Another possible issue is that your query is not distributed to all
shards, if the query does not vary from user to user in your test. If so,
you created a "hot spot", all the load from the 100 users wold go to a
limited number of node with a limited shard count.

The search thread pool seems small with 50 searches if you execute
searches for 100 users in parallel, this can lead to a congestion of the
search module. Why don't you use 100 (at least)?

Jörg

On Wed, Jun 4, 2014 at 2:40 PM, Marcelo Paes Rech <marcelo...@gmail.com
<javascript:>> wrote:

Hi Jörg. Thanks for your reply.

Here is my filter.
{"filter":
{
"terms" : {
"_id" : [ "QSxrbEM8TKe5zr8931xBjA", "wj63ghegRwC6qLsWq2chkA",
"hYEhDbAqQwSRxhYfvDgFkg", "4bZmPE1fTYqijphRyyWiuQ",
"Fhq53yYyT3CEw6vclKu_NA", "XL2atBraTEyx57MefjFVhA",
"951i0dZkT064FlQkzHnnWA", "O8Ixbir1TrGT_IA3wKfsHg",
"8k4U7KsuTmsThqxy-5YaKw", "GNOoQTHglf22kzcE7EOf8g",
"-RQeY48fTg2kYnh2M4E1cQ", "u8DGBdfVR9WRVj6d9E4Ebw",
"WFHSXd7UQvCMYFBhFcTsng", "qnQ7q7FyTsg397lM1EWgqA",
"wRQtUzdMRy2qOkMCNxdpgA", "Ll83iglxSUS_Gs7mjkMt8w",
"d2sxZ1oBTfuvAfov5EJ0iw", "cyht-vB4Q-mMSg9N5jcGXg",
"bNSVaO47QTOCkfJhWo0qjg", "BHuhm55IRerKnynJ8WgFTw",
"fHKA4PF2QteWm8E7dW7CAw", "DLE6A7tyQJ-zcKcCa6IPSA",
"qfelTW7-SuGRQ0GKbngARA", "R7VHHJhYsUqfuxYof8BJ8w",
"W4PqiJfPSlSFjVKFsGkA4Q", "Juq62zOsRdheuW3O6Gb2KA",
"U9v0IKj_RrgRNjE31ZTt2g", "uNHa0kOOT5qjPpzxZcs35A",
"SwOgVNgIRwyVU3pEEycBuQ", "LaEpxFGIQgCArsNZ2rd4Pw",
"CiJ9gouZsbmTtxTWx7w6lA", "TaQV_I01RfCq3B6uAtIBoQ",
"9Jpjo5k-RlGfLVLF6nDgze", "57YpjRdASsrrae-RD3spog",
"bmA4EWFSTiKUaDzaNcCFKQ", "Fui9z_UbRe6AY1VhAr8Crw",
"2PORr5BzSDOmBXgmQkO5Zg", "snfwTmtuTv-uj5mOWSJpgA",
"0nHIrtePSaeW8aWArh_Mrg", "s0g9QHnjTgWX3rCIu1g0Hg",
"Jl67fACuQvCFgZxXAFtDOg" ],
"_cache" : true,
"_cache_key" : "my_terms_cache"
}
}
}

I already used "ids filter" but I got same behaviour. One thing that I
realized is that one of the cluster's nodes is increasing the Search Thread
Pool (something like Queue: 50 and Count: 47) and the others don't
(something like Queue: 0 and Count: 1). If I remove this node from the
cluster another one starts with the same problem.

My current environment is:

  • 7 Data nodes with 16Gb (8Gb for ES)and 8 cores each one;
  • 4 Load balancer Nodes (no data, no master) with 4Gb (3Gb for ES) and 8
    cores each one;
  • 4 MasterNodes (only master, no data) with 4Gb (3Gb for ES) and 8 cores
    each one;
  • Thread Pool Search 47 (the others are standard config);
  • 7 Shards and 2 replicas Index;
  • 14.6Gb Index size (14.524.273 documents);

I'm executing this filter with 50 concurrent users.

Regards

Em terça-feira, 3 de junho de 2014 20h33min45s UTC-3, Jörg Prante
escreveu:

Can you show your test code?

You seem to look at the wrong settings - by adjusting node number, shard
number, replica number alone, you can not find out the maximum node
performance. E.g. concurrency settings, index optimizations, query
optimizations, thread pooling, and most of all, fast disk subsystem I/O is
important.

Jörg

On Wed, Jun 4, 2014 at 12:18 AM, Marcelo Paes Rech <marcelo...@gmail.com

wrote:

Thanks for your reply Nikolas. It helps a lot.

And about the quantity of documents of each shard, or size of each
shard. And the need of no data nodes or only master nodes. When is it
necessary?

Some tests I did, when I increased request's number (like 100 users at
same moment, and redo it again and again), 5 nodes with 1 shard and 2
replicas each and 16Gb RAM (8Gb for ES and 8Gb for OS) weren't enough. The
response time start to increase more than 5s (I think less than 1s, in
this case, would be acceptable) .

This test has a lot of documents (something like 14 millions).

Thanks. Regards.

Em segunda-feira, 2 de junho de 2014 17h09min04s UTC-3, Nikolas Everett
escreveu:

On Mon, Jun 2, 2014 at 3:52 PM, Marcelo Paes Rech <
marcelo...@gmail.com> wrote:

Hi guys,

I'm looking for an article or a guide for the best cluster
configuration. I read a lot of articles like "change this configuration"
and "you must create X shards per node" but I didn't saw nothing like
ElasticSearch Official guide for creating a cluster.

What I would like to know are informations like.

  • How to calculate how many shards will be good for the cluster.
  • How many shards do we need per node? And if this is variable, how
    do I calculate this?
  • How much memory do I need per node and how many nodes?

I think ElasticSearch is well documentated. But it is very fragmented.

For some of these that is because "it depends" is the answer. For
example, you'll want larger heaps for aggregations and faceting.

There are some rules of thumb:

  1. Set Elasticsearch's heap memory to 1/2 of ram but not more then
    30GB. Bigger then that and the JVM can't do pointer compression and you
    effectively lose ram.
  2. #1 implies that having much more then 60GB of ram on each node
    doesn't make a big difference. It helps but its not really as good as
    having more nodes.
  3. The most efficient efficient way of sharding is likely one shard
    on each node. So if you have 9 nodes and a replication factor of 2 (so 3
    total copies) then 3 shards is likely to be more efficient then having 2 or
  4. But this only really matters when those shards get lots of traffic.
    And it breaks down a bit when you get lots of nodes. And the in presence
    of routing. Its complicated.

But these are really just starting points, safe-ish defaults.

Nik

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d32487cb-db92-4b7a-b6b3-afd431beaf61%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d32487cb-db92-4b7a-b6b3-afd431beaf61%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6f2f0e11-dd3d-4fc5-8adb-6eccddf83640%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #8

Ah, that is a simple resolution, thanks for highlighting it.

Jörg

On Thu, Jun 5, 2014 at 2:38 PM, Marcelo Paes Rech <marcelopaesrech@gmail.com

wrote:

Hi Jörg. Thanks for your reply again.

As I said, I already had used ids filter, but I got the same behaviour.

I realized what was wrong. Maybe it could be a bug in ES or not. When I
executed the filter I included "from" and "size" attibutes. In this case
"size" was 999999, but the final result would be just 10 documents.
Aparently ES pre-allocates the objects that I say I will use (maybe for
performance reasons), but if the final result is not the total (999999), ES
doesn't remove remaining pre-allocated objects until the memory (heap) is
full.

I changed the size attribute to 10 and heap became stable.

That's it. Thanks.

Regards.

Em quarta-feira, 4 de junho de 2014 19h54min15s UTC-3, Jörg Prante
escreveu:

Why do you use terms on _id field and not the the ids filter? ids filter
is more efficient since it reuses the _uid field which is cached by default.

Do the terms in the query vary from query to query? If so, caching might
kill your heap.

Another possible issue is that your query is not distributed to all
shards, if the query does not vary from user to user in your test. If so,
you created a "hot spot", all the load from the 100 users wold go to a
limited number of node with a limited shard count.

The search thread pool seems small with 50 searches if you execute
searches for 100 users in parallel, this can lead to a congestion of the
search module. Why don't you use 100 (at least)?

Jörg

On Wed, Jun 4, 2014 at 2:40 PM, Marcelo Paes Rech marcelo...@gmail.com
wrote:

Hi Jörg. Thanks for your reply.

Here is my filter.
{"filter":
{
"terms" : {
"_id" : [ "QSxrbEM8TKe5zr8931xBjA", "wj63ghegRwC6qLsWq2chkA",
"hYEhDbAqQwSRxhYfvDgFkg", "4bZmPE1fTYqijphRyyWiuQ",
"Fhq53yYyT3CEw6vclKu_NA", "XL2atBraTEyx57MefjFVhA",
"951i0dZkT064FlQkzHnnWA", "O8Ixbir1TrGT_IA3wKfsHg",
"8k4U7KsuTmsThqxy-5YaKw", "GNOoQTHglf22kzcE7EOf8g",
"-RQeY48fTg2kYnh2M4E1cQ", "u8DGBdfVR9WRVj6d9E4Ebw",
"WFHSXd7UQvCMYFBhFcTsng", "qnQ7q7FyTsg397lM1EWgqA",
"wRQtUzdMRy2qOkMCNxdpgA", "Ll83iglxSUS_Gs7mjkMt8w",
"d2sxZ1oBTfuvAfov5EJ0iw", "cyht-vB4Q-mMSg9N5jcGXg",
"bNSVaO47QTOCkfJhWo0qjg", "BHuhm55IRerKnynJ8WgFTw",
"fHKA4PF2QteWm8E7dW7CAw", "DLE6A7tyQJ-zcKcCa6IPSA",
"qfelTW7-SuGRQ0GKbngARA", "R7VHHJhYsUqfuxYof8BJ8w",
"W4PqiJfPSlSFjVKFsGkA4Q", "Juq62zOsRdheuW3O6Gb2KA",
"U9v0IKj_RrgRNjE31ZTt2g", "uNHa0kOOT5qjPpzxZcs35A",
"SwOgVNgIRwyVU3pEEycBuQ", "LaEpxFGIQgCArsNZ2rd4Pw",
"CiJ9gouZsbmTtxTWx7w6lA", "TaQV_I01RfCq3B6uAtIBoQ",
"9Jpjo5k-RlGfLVLF6nDgze", "57YpjRdASsrrae-RD3spog",
"bmA4EWFSTiKUaDzaNcCFKQ", "Fui9z_UbRe6AY1VhAr8Crw",
"2PORr5BzSDOmBXgmQkO5Zg", "snfwTmtuTv-uj5mOWSJpgA",
"0nHIrtePSaeW8aWArh_Mrg", "s0g9QHnjTgWX3rCIu1g0Hg",
"Jl67fACuQvCFgZxXAFtDOg" ],
"_cache" : true,
"_cache_key" : "my_terms_cache"
}
}
}

I already used "ids filter" but I got same behaviour. One thing that
I realized is that one of the cluster's nodes is increasing the Search
Thread Pool (something like Queue: 50 and Count: 47) and the others don't
(something like Queue: 0 and Count: 1). If I remove this node from the
cluster another one starts with the same problem.

My current environment is:

  • 7 Data nodes with 16Gb (8Gb for ES)and 8 cores each one;
  • 4 Load balancer Nodes (no data, no master) with 4Gb (3Gb for ES) and
    8 cores each one;
  • 4 MasterNodes (only master, no data) with 4Gb (3Gb for ES) and 8
    cores each one;
  • Thread Pool Search 47 (the others are standard config);
  • 7 Shards and 2 replicas Index;
  • 14.6Gb Index size (14.524.273 documents);

I'm executing this filter with 50 concurrent users.

Regards

Em terça-feira, 3 de junho de 2014 20h33min45s UTC-3, Jörg Prante
escreveu:

Can you show your test code?

You seem to look at the wrong settings - by adjusting node number,
shard number, replica number alone, you can not find out the maximum node
performance. E.g. concurrency settings, index optimizations, query
optimizations, thread pooling, and most of all, fast disk subsystem I/O is
important.

Jörg

On Wed, Jun 4, 2014 at 12:18 AM, Marcelo Paes Rech <
marcelo...@gmail.com> wrote:

Thanks for your reply Nikolas. It helps a lot.

And about the quantity of documents of each shard, or size of each
shard. And the need of no data nodes or only master nodes. When is it
necessary?

Some tests I did, when I increased request's number (like 100 users at
same moment, and redo it again and again), 5 nodes with 1 shard and 2
replicas each and 16Gb RAM (8Gb for ES and 8Gb for OS) weren't enough. The
response time start to increase more than 5s (I think less than 1s, in
this case, would be acceptable) .

This test has a lot of documents (something like 14 millions).

Thanks. Regards.

Em segunda-feira, 2 de junho de 2014 17h09min04s UTC-3, Nikolas
Everett escreveu:

On Mon, Jun 2, 2014 at 3:52 PM, Marcelo Paes Rech <
marcelo...@gmail.com> wrote:

Hi guys,

I'm looking for an article or a guide for the best cluster
configuration. I read a lot of articles like "change this configuration"
and "you must create X shards per node" but I didn't saw nothing like
ElasticSearch Official guide for creating a cluster.

What I would like to know are informations like.

  • How to calculate how many shards will be good for the cluster.
  • How many shards do we need per node? And if this is variable, how
    do I calculate this?
  • How much memory do I need per node and how many nodes?

I think ElasticSearch is well documentated. But it is very
fragmented.

For some of these that is because "it depends" is the answer. For
example, you'll want larger heaps for aggregations and faceting.

There are some rules of thumb:

  1. Set Elasticsearch's heap memory to 1/2 of ram but not more then
    30GB. Bigger then that and the JVM can't do pointer compression and you
    effectively lose ram.
  2. #1 implies that having much more then 60GB of ram on each node
    doesn't make a big difference. It helps but its not really as good as
    having more nodes.
  3. The most efficient efficient way of sharding is likely one shard
    on each node. So if you have 9 nodes and a replication factor of 2 (so 3
    total copies) then 3 shards is likely to be more efficient then having 2 or
  4. But this only really matters when those shards get lots of traffic.
    And it breaks down a bit when you get lots of nodes. And the in presence
    of routing. Its complicated.

But these are really just starting points, safe-ish defaults.

Nik

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/d32487cb-db92-4b7a-b6b3-afd431beaf61%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d32487cb-db92-4b7a-b6b3-afd431beaf61%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6f2f0e11-dd3d-4fc5-8adb-6eccddf83640%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6f2f0e11-dd3d-4fc5-8adb-6eccddf83640%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEW6q%2BR%2BTVYGL_21oDimEWiZUgUb_gcLJUMBq6BLZkHnw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #9

This would probably be worth raising as a github issue -
https://github.com/elasticsearch/

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 5 June 2014 22:38, Marcelo Paes Rech marcelopaesrech@gmail.com wrote:

Hi Jörg. Thanks for your reply again.

As I said, I already had used ids filter, but I got the same behaviour.

I realized what was wrong. Maybe it could be a bug in ES or not. When I
executed the filter I included "from" and "size" attibutes. In this case
"size" was 999999, but the final result would be just 10 documents.
Aparently ES pre-allocates the objects that I say I will use (maybe for
performance reasons), but if the final result is not the total (999999), ES
doesn't remove remaining pre-allocated objects until the memory (heap) is
full.

I changed the size attribute to 10 and heap became stable.

That's it. Thanks.

Regards.

Em quarta-feira, 4 de junho de 2014 19h54min15s UTC-3, Jörg Prante
escreveu:

Why do you use terms on _id field and not the the ids filter? ids filter
is more efficient since it reuses the _uid field which is cached by default.

Do the terms in the query vary from query to query? If so, caching might
kill your heap.

Another possible issue is that your query is not distributed to all
shards, if the query does not vary from user to user in your test. If so,
you created a "hot spot", all the load from the 100 users wold go to a
limited number of node with a limited shard count.

The search thread pool seems small with 50 searches if you execute
searches for 100 users in parallel, this can lead to a congestion of the
search module. Why don't you use 100 (at least)?

Jörg

On Wed, Jun 4, 2014 at 2:40 PM, Marcelo Paes Rech marcelo...@gmail.com
wrote:

Hi Jörg. Thanks for your reply.

Here is my filter.
{"filter":
{
"terms" : {
"_id" : [ "QSxrbEM8TKe5zr8931xBjA", "wj63ghegRwC6qLsWq2chkA",
"hYEhDbAqQwSRxhYfvDgFkg", "4bZmPE1fTYqijphRyyWiuQ",
"Fhq53yYyT3CEw6vclKu_NA", "XL2atBraTEyx57MefjFVhA",
"951i0dZkT064FlQkzHnnWA", "O8Ixbir1TrGT_IA3wKfsHg",
"8k4U7KsuTmsThqxy-5YaKw", "GNOoQTHglf22kzcE7EOf8g",
"-RQeY48fTg2kYnh2M4E1cQ", "u8DGBdfVR9WRVj6d9E4Ebw",
"WFHSXd7UQvCMYFBhFcTsng", "qnQ7q7FyTsg397lM1EWgqA",
"wRQtUzdMRy2qOkMCNxdpgA", "Ll83iglxSUS_Gs7mjkMt8w",
"d2sxZ1oBTfuvAfov5EJ0iw", "cyht-vB4Q-mMSg9N5jcGXg",
"bNSVaO47QTOCkfJhWo0qjg", "BHuhm55IRerKnynJ8WgFTw",
"fHKA4PF2QteWm8E7dW7CAw", "DLE6A7tyQJ-zcKcCa6IPSA",
"qfelTW7-SuGRQ0GKbngARA", "R7VHHJhYsUqfuxYof8BJ8w",
"W4PqiJfPSlSFjVKFsGkA4Q", "Juq62zOsRdheuW3O6Gb2KA",
"U9v0IKj_RrgRNjE31ZTt2g", "uNHa0kOOT5qjPpzxZcs35A",
"SwOgVNgIRwyVU3pEEycBuQ", "LaEpxFGIQgCArsNZ2rd4Pw",
"CiJ9gouZsbmTtxTWx7w6lA", "TaQV_I01RfCq3B6uAtIBoQ",
"9Jpjo5k-RlGfLVLF6nDgze", "57YpjRdASsrrae-RD3spog",
"bmA4EWFSTiKUaDzaNcCFKQ", "Fui9z_UbRe6AY1VhAr8Crw",
"2PORr5BzSDOmBXgmQkO5Zg", "snfwTmtuTv-uj5mOWSJpgA",
"0nHIrtePSaeW8aWArh_Mrg", "s0g9QHnjTgWX3rCIu1g0Hg",
"Jl67fACuQvCFgZxXAFtDOg" ],
"_cache" : true,
"_cache_key" : "my_terms_cache"
}
}
}

I already used "ids filter" but I got same behaviour. One thing that
I realized is that one of the cluster's nodes is increasing the Search
Thread Pool (something like Queue: 50 and Count: 47) and the others don't
(something like Queue: 0 and Count: 1). If I remove this node from the
cluster another one starts with the same problem.

My current environment is:

  • 7 Data nodes with 16Gb (8Gb for ES)and 8 cores each one;
  • 4 Load balancer Nodes (no data, no master) with 4Gb (3Gb for ES) and
    8 cores each one;
  • 4 MasterNodes (only master, no data) with 4Gb (3Gb for ES) and 8
    cores each one;
  • Thread Pool Search 47 (the others are standard config);
  • 7 Shards and 2 replicas Index;
  • 14.6Gb Index size (14.524.273 documents);

I'm executing this filter with 50 concurrent users.

Regards

Em terça-feira, 3 de junho de 2014 20h33min45s UTC-3, Jörg Prante
escreveu:

Can you show your test code?

You seem to look at the wrong settings - by adjusting node number,
shard number, replica number alone, you can not find out the maximum node
performance. E.g. concurrency settings, index optimizations, query
optimizations, thread pooling, and most of all, fast disk subsystem I/O is
important.

Jörg

On Wed, Jun 4, 2014 at 12:18 AM, Marcelo Paes Rech <
marcelo...@gmail.com> wrote:

Thanks for your reply Nikolas. It helps a lot.

And about the quantity of documents of each shard, or size of each
shard. And the need of no data nodes or only master nodes. When is it
necessary?

Some tests I did, when I increased request's number (like 100 users at
same moment, and redo it again and again), 5 nodes with 1 shard and 2
replicas each and 16Gb RAM (8Gb for ES and 8Gb for OS) weren't enough. The
response time start to increase more than 5s (I think less than 1s, in
this case, would be acceptable) .

This test has a lot of documents (something like 14 millions).

Thanks. Regards.

Em segunda-feira, 2 de junho de 2014 17h09min04s UTC-3, Nikolas
Everett escreveu:

On Mon, Jun 2, 2014 at 3:52 PM, Marcelo Paes Rech <
marcelo...@gmail.com> wrote:

Hi guys,

I'm looking for an article or a guide for the best cluster
configuration. I read a lot of articles like "change this configuration"
and "you must create X shards per node" but I didn't saw nothing like
ElasticSearch Official guide for creating a cluster.

What I would like to know are informations like.

  • How to calculate how many shards will be good for the cluster.
  • How many shards do we need per node? And if this is variable, how
    do I calculate this?
  • How much memory do I need per node and how many nodes?

I think ElasticSearch is well documentated. But it is very
fragmented.

For some of these that is because "it depends" is the answer. For
example, you'll want larger heaps for aggregations and faceting.

There are some rules of thumb:

  1. Set Elasticsearch's heap memory to 1/2 of ram but not more then
    30GB. Bigger then that and the JVM can't do pointer compression and you
    effectively lose ram.
  2. #1 implies that having much more then 60GB of ram on each node
    doesn't make a big difference. It helps but its not really as good as
    having more nodes.
  3. The most efficient efficient way of sharding is likely one shard
    on each node. So if you have 9 nodes and a replication factor of 2 (so 3
    total copies) then 3 shards is likely to be more efficient then having 2 or
  4. But this only really matters when those shards get lots of traffic.
    And it breaks down a bit when you get lots of nodes. And the in presence
    of routing. Its complicated.

But these are really just starting points, safe-ish defaults.

Nik

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/d32487cb-db92-4b7a-b6b3-afd431beaf61%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d32487cb-db92-4b7a-b6b3-afd431beaf61%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6f2f0e11-dd3d-4fc5-8adb-6eccddf83640%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6f2f0e11-dd3d-4fc5-8adb-6eccddf83640%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bcbP3uGf7_n9QV3QpumDJ-vVrQ4x_JPSV0-2TT80N4rg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Marcelo Paes Rech) #10

I have just created a issue:

Regards.

Em quinta-feira, 5 de junho de 2014 20h02min05s UTC-3, Mark Walkom escreveu:

This would probably be worth raising as a github issue -
https://github.com/elasticsearch/

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com <javascript:>
web: www.campaignmonitor.com

On 5 June 2014 22:38, Marcelo Paes Rech <marcelo...@gmail.com
<javascript:>> wrote:

Hi Jörg. Thanks for your reply again.

As I said, I already had used ids filter, but I got the same behaviour.

I realized what was wrong. Maybe it could be a bug in ES or not. When I
executed the filter I included "from" and "size" attibutes. In this case
"size" was 999999, but the final result would be just 10 documents.
Aparently ES pre-allocates the objects that I say I will use (maybe for
performance reasons), but if the final result is not the total (999999), ES
doesn't remove remaining pre-allocated objects until the memory (heap) is
full.

I changed the size attribute to 10 and heap became stable.

That's it. Thanks.

Regards.

Em quarta-feira, 4 de junho de 2014 19h54min15s UTC-3, Jörg Prante
escreveu:

Why do you use terms on _id field and not the the ids filter? ids filter
is more efficient since it reuses the _uid field which is cached by default.

Do the terms in the query vary from query to query? If so, caching might
kill your heap.

Another possible issue is that your query is not distributed to all
shards, if the query does not vary from user to user in your test. If so,
you created a "hot spot", all the load from the 100 users wold go to a
limited number of node with a limited shard count.

The search thread pool seems small with 50 searches if you execute
searches for 100 users in parallel, this can lead to a congestion of the
search module. Why don't you use 100 (at least)?

Jörg

On Wed, Jun 4, 2014 at 2:40 PM, Marcelo Paes Rech marcelo...@gmail.com
wrote:

Hi Jörg. Thanks for your reply.

Here is my filter.
{"filter":
{
"terms" : {
"_id" : [ "QSxrbEM8TKe5zr8931xBjA", "wj63ghegRwC6qLsWq2chkA",
"hYEhDbAqQwSRxhYfvDgFkg", "4bZmPE1fTYqijphRyyWiuQ",
"Fhq53yYyT3CEw6vclKu_NA", "XL2atBraTEyx57MefjFVhA",
"951i0dZkT064FlQkzHnnWA", "O8Ixbir1TrGT_IA3wKfsHg",
"8k4U7KsuTmsThqxy-5YaKw", "GNOoQTHglf22kzcE7EOf8g",
"-RQeY48fTg2kYnh2M4E1cQ", "u8DGBdfVR9WRVj6d9E4Ebw",
"WFHSXd7UQvCMYFBhFcTsng", "qnQ7q7FyTsg397lM1EWgqA",
"wRQtUzdMRy2qOkMCNxdpgA", "Ll83iglxSUS_Gs7mjkMt8w",
"d2sxZ1oBTfuvAfov5EJ0iw", "cyht-vB4Q-mMSg9N5jcGXg",
"bNSVaO47QTOCkfJhWo0qjg", "BHuhm55IRerKnynJ8WgFTw",
"fHKA4PF2QteWm8E7dW7CAw", "DLE6A7tyQJ-zcKcCa6IPSA",
"qfelTW7-SuGRQ0GKbngARA", "R7VHHJhYsUqfuxYof8BJ8w",
"W4PqiJfPSlSFjVKFsGkA4Q", "Juq62zOsRdheuW3O6Gb2KA",
"U9v0IKj_RrgRNjE31ZTt2g", "uNHa0kOOT5qjPpzxZcs35A",
"SwOgVNgIRwyVU3pEEycBuQ", "LaEpxFGIQgCArsNZ2rd4Pw",
"CiJ9gouZsbmTtxTWx7w6lA", "TaQV_I01RfCq3B6uAtIBoQ",
"9Jpjo5k-RlGfLVLF6nDgze", "57YpjRdASsrrae-RD3spog",
"bmA4EWFSTiKUaDzaNcCFKQ", "Fui9z_UbRe6AY1VhAr8Crw",
"2PORr5BzSDOmBXgmQkO5Zg", "snfwTmtuTv-uj5mOWSJpgA",
"0nHIrtePSaeW8aWArh_Mrg", "s0g9QHnjTgWX3rCIu1g0Hg",
"Jl67fACuQvCFgZxXAFtDOg" ],
"_cache" : true,
"_cache_key" : "my_terms_cache"
}
}
}

I already used "ids filter" but I got same behaviour. One thing that
I realized is that one of the cluster's nodes is increasing the Search
Thread Pool (something like Queue: 50 and Count: 47) and the others don't
(something like Queue: 0 and Count: 1). If I remove this node from the
cluster another one starts with the same problem.

My current environment is:

  • 7 Data nodes with 16Gb (8Gb for ES)and 8 cores each one;
  • 4 Load balancer Nodes (no data, no master) with 4Gb (3Gb for ES) and
    8 cores each one;
  • 4 MasterNodes (only master, no data) with 4Gb (3Gb for ES) and 8
    cores each one;
  • Thread Pool Search 47 (the others are standard config);
  • 7 Shards and 2 replicas Index;
  • 14.6Gb Index size (14.524.273 documents);

I'm executing this filter with 50 concurrent users.

Regards

Em terça-feira, 3 de junho de 2014 20h33min45s UTC-3, Jörg Prante
escreveu:

Can you show your test code?

You seem to look at the wrong settings - by adjusting node number,
shard number, replica number alone, you can not find out the maximum node
performance. E.g. concurrency settings, index optimizations, query
optimizations, thread pooling, and most of all, fast disk subsystem I/O is
important.

Jörg

On Wed, Jun 4, 2014 at 12:18 AM, Marcelo Paes Rech <
marcelo...@gmail.com> wrote:

Thanks for your reply Nikolas. It helps a lot.

And about the quantity of documents of each shard, or size of each
shard. And the need of no data nodes or only master nodes. When is it
necessary?

Some tests I did, when I increased request's number (like 100 users
at same moment, and redo it again and again), 5 nodes with 1 shard and 2
replicas each and 16Gb RAM (8Gb for ES and 8Gb for OS) weren't enough. The
response time start to increase more than 5s (I think less than 1s, in
this case, would be acceptable) .

This test has a lot of documents (something like 14 millions).

Thanks. Regards.

Em segunda-feira, 2 de junho de 2014 17h09min04s UTC-3, Nikolas
Everett escreveu:

On Mon, Jun 2, 2014 at 3:52 PM, Marcelo Paes Rech <
marcelo...@gmail.com> wrote:

Hi guys,

I'm looking for an article or a guide for the best cluster
configuration. I read a lot of articles like "change this configuration"
and "you must create X shards per node" but I didn't saw nothing like
ElasticSearch Official guide for creating a cluster.

What I would like to know are informations like.

  • How to calculate how many shards will be good for the cluster.
  • How many shards do we need per node? And if this is variable, how
    do I calculate this?
  • How much memory do I need per node and how many nodes?

I think ElasticSearch is well documentated. But it is very
fragmented.

For some of these that is because "it depends" is the answer. For
example, you'll want larger heaps for aggregations and faceting.

There are some rules of thumb:

  1. Set Elasticsearch's heap memory to 1/2 of ram but not more then
    30GB. Bigger then that and the JVM can't do pointer compression and you
    effectively lose ram.
  2. #1 implies that having much more then 60GB of ram on each node
    doesn't make a big difference. It helps but its not really as good as
    having more nodes.
  3. The most efficient efficient way of sharding is likely one shard
    on each node. So if you have 9 nodes and a replication factor of 2 (so 3
    total copies) then 3 shards is likely to be more efficient then having 2 or
  4. But this only really matters when those shards get lots of traffic.
    And it breaks down a bit when you get lots of nodes. And the in presence
    of routing. Its complicated.

But these are really just starting points, safe-ish defaults.

Nik

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/94b8ecf9-efc4-4046-a862-63b670ccc23e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/d32487cb-db92-4b7a-b6b3-afd431beaf61%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d32487cb-db92-4b7a-b6b3-afd431beaf61%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6f2f0e11-dd3d-4fc5-8adb-6eccddf83640%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6f2f0e11-dd3d-4fc5-8adb-6eccddf83640%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c727d421-6426-4c52-b3ce-f8533e3dc68b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #11

On Thursday, 5 June 2014 00:54:15 UTC+2, Jörg Prante wrote:

Why do you use terms on _id field and not the the ids filter? ids filter
is more efficient since it reuses the _uid field which is cached by default.

So does the terms filter. The only advantage of the _ids filter is that
you can specify a type for each id, otherwise it looks for type1#id,
type2#id etc, but that's the same for the ids filter if no type is
specified.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6e80bac7-4957-40cc-935a-e9c027fb5941%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #12