Increasing shards and then nodes


(Lee Parker) #1

We have been using elastic search within our production environment for a
few months now. I have discovered that I didn't properly plan for the
amount of data in the index and find myself in need of increasing the number
of shards to improve performance. I am also going to increase the number of
nodes. I know that you can't just increase the number of shards without
reindexing the data. The difficulty in reindexing the data for me is that
our data currently resides in a highly normalized system and de-normalizing
each item into the document that ES indexes can take a long time (2.5 weeks
last time).

In that case, here is my plan for increasing my shard count:

  1. Spin up a new ES server but make sure it doesn't join the existing server
    as a node.
  2. Pull the list of documents from my data and then get the documents from
    the existing ES server and put them on the new one.
  3. Once all the data is in the new ES server, shutdown the older one wipe
    its data and then start it up as a new node to join the new ES server.

Is there a better way to keep things going and increase the shard count? We
are currently using the 5 shard default and all of the shards are around
12G. I am thinking about setting it to 15 shards and 1 (possibly 2)
replicas. Does this make sense? We are going to have a 3 node cluster by
the end of the month and will likely increase that as we see fit throughout
the year.

Lee

"It doesn't matter whether you are liberal or conservative, but it's
dangerous to always think with exclamation points instead of question
marks."
by Marty Beckerman


(Clinton Gormley) #2

Hi Lee

In that case, here is my plan for increasing my shard count:

  1. Spin up a new ES server but make sure it doesn't join the existing
    server as a node.
  2. Pull the list of documents from my data and then get the documents
    from the existing ES server and put them on the new one.
  3. Once all the data is in the new ES server, shutdown the older one
    wipe its data and then start it up as a new node to join the new ES
    server.

Alternatively, instead of messing with a new server, you could:

  • create a new index 'new_index_timestamp' on the same ES server
  • index from 'my_index' to 'new_index_timestamp'
  • delete 'my_index'
  • create alias 'my_index' pointing to 'new_index_timestamp'

The benefit of switching to using aliases is that it will make it easier
to make changes to your index in the future.

clint


(Shay Banon) #3

The 5 shards default allow you to grow up to 5 nodes in terms of index capacity, and in terms of improving search capacity, you can simply dynamically add more replicas to the cluster. So, with a 5 shard 1 replica option, you will max out on 10 nodes (nothing will be left to be allocated to the 11th node if started).

Sounds like you are planning on growing from 1-2 nodes to 3, this is still well below the capacity provided by 5 shards and 1 replica. A 12gb index is not a big one.

-shay.banon
On Wednesday, January 19, 2011 at 7:39 PM, Clinton Gormley wrote:

Hi Lee

In that case, here is my plan for increasing my shard count:

  1. Spin up a new ES server but make sure it doesn't join the existing
    server as a node.
  2. Pull the list of documents from my data and then get the documents
    from the existing ES server and put them on the new one.
  3. Once all the data is in the new ES server, shutdown the older one
    wipe its data and then start it up as a new node to join the new ES
    server.

Alternatively, instead of messing with a new server, you could:

  • create a new index 'new_index_timestamp' on the same ES server
  • index from 'my_index' to 'new_index_timestamp'
  • delete 'my_index'
  • create alias 'my_index' pointing to 'new_index_timestamp'

The benefit of switching to using aliases is that it will make it easier
to make changes to your index in the future.

clint


(Lee Parker) #4

It was my understanding based on some other threads that performance will
degrade once shards are greater than 10g. If that is not true then I will
just stick with the 5 shards and 1 replica setup.

Lee

"It doesn't matter whether you are liberal or conservative, but it's
dangerous to always think with exclamation points instead of question
marks."
by Marty Beckerman
On Wed, Jan 19, 2011 at 1:22 PM, Shay Banon shay.banon@elasticsearch.comwrote:

The 5 shards default allow you to grow up to 5 nodes in terms of index
capacity, and in terms of improving search capacity, you can simply
dynamically add more replicas to the cluster. So, with a 5 shard 1 replica
option, you will max out on 10 nodes (nothing will be left to be allocated
to the 11th node if started).

Sounds like you are planning on growing from 1-2 nodes to 3, this is still
well below the capacity provided by 5 shards and 1 replica. A 12gb index is
not a big one.

-shay.banon

On Wednesday, January 19, 2011 at 7:39 PM, Clinton Gormley wrote:

Hi Lee

In that case, here is my plan for increasing my shard count:

  1. Spin up a new ES server but make sure it doesn't join the existing
    server as a node.
  2. Pull the list of documents from my data and then get the documents
    from the existing ES server and put them on the new one.
  3. Once all the data is in the new ES server, shutdown the older one
    wipe its data and then start it up as a new node to join the new ES
    server.

Alternatively, instead of messing with a new server, you could:

  • create a new index 'new_index_timestamp' on the same ES server
  • index from 'my_index' to 'new_index_timestamp'
  • delete 'my_index'
  • create alias 'my_index' pointing to 'new_index_timestamp'

The benefit of switching to using aliases is that it will make it easier
to make changes to your index in the future.

clint


(Shay Banon) #5

It really depends what you plan to do with it. They can certainly be bigger than 10gb, if you use things like facets or sorting, the values of the field per doc will be loaded to memory, which means that you might need more memory as the shard doc count grows, but that is linear growth, that can be easily derived from some capacity planning down on a smaller size.

-shay.banon
On Thursday, January 20, 2011 at 4:01 PM, Lee Parker wrote:

It was my understanding based on some other threads that performance will degrade once shards are greater than 10g. If that is not true then I will just stick with the 5 shards and 1 replica setup.

Lee

"It doesn't matter whether you are liberal or conservative, but it's dangerous to always think with exclamation points instead of question marks."
by Marty Beckerman

On Wed, Jan 19, 2011 at 1:22 PM, Shay Banon shay.banon@elasticsearch.com wrote:

The 5 shards default allow you to grow up to 5 nodes in terms of index capacity, and in terms of improving search capacity, you can simply dynamically add more replicas to the cluster. So, with a 5 shard 1 replica option, you will max out on 10 nodes (nothing will be left to be allocated to the 11th node if started).

Sounds like you are planning on growing from 1-2 nodes to 3, this is still well below the capacity provided by 5 shards and 1 replica. A 12gb index is not a big one.

-shay.banon
On Wednesday, January 19, 2011 at 7:39 PM, Clinton Gormley wrote:

Hi Lee

In that case, here is my plan for increasing my shard count:

  1. Spin up a new ES server but make sure it doesn't join the existing
    server as a node.
  2. Pull the list of documents from my data and then get the documents
    from the existing ES server and put them on the new one.
  3. Once all the data is in the new ES server, shutdown the older one
    wipe its data and then start it up as a new node to join the new ES
    server.

Alternatively, instead of messing with a new server, you could:

  • create a new index 'new_index_timestamp' on the same ES server
  • index from 'my_index' to 'new_index_timestamp'
  • delete 'my_index'
  • create alias 'my_index' pointing to 'new_index_timestamp'

The benefit of switching to using aliases is that it will make it easier
to make changes to your index in the future.

clint


(Lee Parker) #6

We do regularly sort the results using a field we call date which contains a
unix timestamp as an integer. we are already experiencing slow results from
a filtered and sorted query. If we currently have about 60G of data, which
we do filtered and sorted queries against regularly, should we plan to use a
greater number of shards?

Lee

"It doesn't matter whether you are liberal or conservative, but it's
dangerous to always think with exclamation points instead of question
marks."
by Marty Beckerman
On Thu, Jan 20, 2011 at 10:18 AM, Shay Banon
shay.banon@elasticsearch.comwrote:

It really depends what you plan to do with it. They can certainly be
bigger than 10gb, if you use things like facets or sorting, the values of
the field per doc will be loaded to memory, which means that you might need
more memory as the shard doc count grows, but that is linear growth, that
can be easily derived from some capacity planning down on a smaller size.

-shay.banon

On Thursday, January 20, 2011 at 4:01 PM, Lee Parker wrote:

It was my understanding based on some other threads that performance will
degrade once shards are greater than 10g. If that is not true then I will
just stick with the 5 shards and 1 replica setup.

Lee

"It doesn't matter whether you are liberal or conservative, but it's
dangerous to always think with exclamation points instead of question
marks."
by Marty Beckerman
On Wed, Jan 19, 2011 at 1:22 PM, Shay Banon shay.banon@elasticsearch.comwrote:

The 5 shards default allow you to grow up to 5 nodes in terms of index
capacity, and in terms of improving search capacity, you can simply
dynamically add more replicas to the cluster. So, with a 5 shard 1 replica
option, you will max out on 10 nodes (nothing will be left to be allocated
to the 11th node if started).

Sounds like you are planning on growing from 1-2 nodes to 3, this is still
well below the capacity provided by 5 shards and 1 replica. A 12gb index is
not a big one.

-shay.banon

On Wednesday, January 19, 2011 at 7:39 PM, Clinton Gormley wrote:

Hi Lee

In that case, here is my plan for increasing my shard count:

  1. Spin up a new ES server but make sure it doesn't join the existing
    server as a node.
  2. Pull the list of documents from my data and then get the documents
    from the existing ES server and put them on the new one.
  3. Once all the data is in the new ES server, shutdown the older one
    wipe its data and then start it up as a new node to join the new ES
    server.

Alternatively, instead of messing with a new server, you could:

  • create a new index 'new_index_timestamp' on the same ES server
  • index from 'my_index' to 'new_index_timestamp'
  • delete 'my_index'
  • create alias 'my_index' pointing to 'new_index_timestamp'

The benefit of switching to using aliases is that it will make it easier
to make changes to your index in the future.

clint


(Shay Banon) #7

As long as you plan to add more capacity when it comes to number of servers/nodes.
On Thursday, January 20, 2011 at 6:36 PM, Lee Parker wrote:

We do regularly sort the results using a field we call date which contains a unix timestamp as an integer. we are already experiencing slow results from a filtered and sorted query. If we currently have about 60G of data, which we do filtered and sorted queries against regularly, should we plan to use a greater number of shards?

Lee

"It doesn't matter whether you are liberal or conservative, but it's dangerous to always think with exclamation points instead of question marks."
by Marty Beckerman

On Thu, Jan 20, 2011 at 10:18 AM, Shay Banon shay.banon@elasticsearch.com wrote:

It really depends what you plan to do with it. They can certainly be bigger than 10gb, if you use things like facets or sorting, the values of the field per doc will be loaded to memory, which means that you might need more memory as the shard doc count grows, but that is linear growth, that can be easily derived from some capacity planning down on a smaller size.

-shay.banon
On Thursday, January 20, 2011 at 4:01 PM, Lee Parker wrote:

It was my understanding based on some other threads that performance will degrade once shards are greater than 10g. If that is not true then I will just stick with the 5 shards and 1 replica setup.

Lee

"It doesn't matter whether you are liberal or conservative, but it's dangerous to always think with exclamation points instead of question marks."
by Marty Beckerman

On Wed, Jan 19, 2011 at 1:22 PM, Shay Banon shay.banon@elasticsearch.com wrote:

The 5 shards default allow you to grow up to 5 nodes in terms of index capacity, and in terms of improving search capacity, you can simply dynamically add more replicas to the cluster. So, with a 5 shard 1 replica option, you will max out on 10 nodes (nothing will be left to be allocated to the 11th node if started).

Sounds like you are planning on growing from 1-2 nodes to 3, this is still well below the capacity provided by 5 shards and 1 replica. A 12gb index is not a big one.

-shay.banon
On Wednesday, January 19, 2011 at 7:39 PM, Clinton Gormley wrote:

Hi Lee

In that case, here is my plan for increasing my shard count:

  1. Spin up a new ES server but make sure it doesn't join the existing
    server as a node.
  2. Pull the list of documents from my data and then get the documents
    from the existing ES server and put them on the new one.
  3. Once all the data is in the new ES server, shutdown the older one
    wipe its data and then start it up as a new node to join the new ES
    server.

Alternatively, instead of messing with a new server, you could:

  • create a new index 'new_index_timestamp' on the same ES server
  • index from 'my_index' to 'new_index_timestamp'
  • delete 'my_index'
  • create alias 'my_index' pointing to 'new_index_timestamp'

The benefit of switching to using aliases is that it will make it easier
to make changes to your index in the future.

clint


(Shay Banon) #8

And make sure you have enough memory allocated to the nodes as well. You can check how much memory is being occupied by the field data cache by using the node stats api.
On Thursday, January 20, 2011 at 6:38 PM, Shay Banon wrote:

As long as you plan to add more capacity when it comes to number of servers/nodes.
On Thursday, January 20, 2011 at 6:36 PM, Lee Parker wrote:

We do regularly sort the results using a field we call date which contains a unix timestamp as an integer. we are already experiencing slow results from a filtered and sorted query. If we currently have about 60G of data, which we do filtered and sorted queries against regularly, should we plan to use a greater number of shards?

Lee

"It doesn't matter whether you are liberal or conservative, but it's dangerous to always think with exclamation points instead of question marks."
by Marty Beckerman

On Thu, Jan 20, 2011 at 10:18 AM, Shay Banon shay.banon@elasticsearch.com wrote:

It really depends what you plan to do with it. They can certainly be bigger than 10gb, if you use things like facets or sorting, the values of the field per doc will be loaded to memory, which means that you might need more memory as the shard doc count grows, but that is linear growth, that can be easily derived from some capacity planning down on a smaller size.

-shay.banon
On Thursday, January 20, 2011 at 4:01 PM, Lee Parker wrote:

It was my understanding based on some other threads that performance will degrade once shards are greater than 10g. If that is not true then I will just stick with the 5 shards and 1 replica setup.

Lee

"It doesn't matter whether you are liberal or conservative, but it's dangerous to always think with exclamation points instead of question marks."
by Marty Beckerman

On Wed, Jan 19, 2011 at 1:22 PM, Shay Banon shay.banon@elasticsearch.com wrote:

The 5 shards default allow you to grow up to 5 nodes in terms of index capacity, and in terms of improving search capacity, you can simply dynamically add more replicas to the cluster. So, with a 5 shard 1 replica option, you will max out on 10 nodes (nothing will be left to be allocated to the 11th node if started).

Sounds like you are planning on growing from 1-2 nodes to 3, this is still well below the capacity provided by 5 shards and 1 replica. A 12gb index is not a big one.

-shay.banon
On Wednesday, January 19, 2011 at 7:39 PM, Clinton Gormley wrote:

Hi Lee

In that case, here is my plan for increasing my shard count:

  1. Spin up a new ES server but make sure it doesn't join the existing
    server as a node.
  2. Pull the list of documents from my data and then get the documents
    from the existing ES server and put them on the new one.
  3. Once all the data is in the new ES server, shutdown the older one
    wipe its data and then start it up as a new node to join the new ES
    server.

Alternatively, instead of messing with a new server, you could:

  • create a new index 'new_index_timestamp' on the same ES server
  • index from 'my_index' to 'new_index_timestamp'
  • delete 'my_index'
  • create alias 'my_index' pointing to 'new_index_timestamp'

The benefit of switching to using aliases is that it will make it easier
to make changes to your index in the future.

clint


(Lee Parker) #9

The current node has 1.2g available to the heap and shows that it is using
1g of that. I will be building out to at least three nodes in the next few
weeks. Is 1.2g of heap enough when we have shards of this size or larger?

Lee

"It doesn't matter whether you are liberal or conservative, but it's
dangerous to always think with exclamation points instead of question
marks."
by Marty Beckerman
On Thu, Jan 20, 2011 at 10:39 AM, Shay Banon
shay.banon@elasticsearch.comwrote:

And make sure you have enough memory allocated to the nodes as well. You
can check how much memory is being occupied by the field data cache by using
the node stats api.

On Thursday, January 20, 2011 at 6:38 PM, Shay Banon wrote:

As long as you plan to add more capacity when it comes to number of
servers/nodes.

On Thursday, January 20, 2011 at 6:36 PM, Lee Parker wrote:

We do regularly sort the results using a field we call date which contains
a unix timestamp as an integer. we are already experiencing slow results
from a filtered and sorted query. If we currently have about 60G of data,
which we do filtered and sorted queries against regularly, should we plan to
use a greater number of shards?

Lee

"It doesn't matter whether you are liberal or conservative, but it's
dangerous to always think with exclamation points instead of question
marks."
by Marty Beckerman
On Thu, Jan 20, 2011 at 10:18 AM, Shay Banon <shay.banon@elasticsearch.com

wrote:

It really depends what you plan to do with it. They can certainly be
bigger than 10gb, if you use things like facets or sorting, the values of
the field per doc will be loaded to memory, which means that you might need
more memory as the shard doc count grows, but that is linear growth, that
can be easily derived from some capacity planning down on a smaller size.

-shay.banon

On Thursday, January 20, 2011 at 4:01 PM, Lee Parker wrote:

It was my understanding based on some other threads that performance will
degrade once shards are greater than 10g. If that is not true then I will
just stick with the 5 shards and 1 replica setup.

Lee

"It doesn't matter whether you are liberal or conservative, but it's
dangerous to always think with exclamation points instead of question
marks."
by Marty Beckerman
On Wed, Jan 19, 2011 at 1:22 PM, Shay Banon shay.banon@elasticsearch.comwrote:

The 5 shards default allow you to grow up to 5 nodes in terms of index
capacity, and in terms of improving search capacity, you can simply
dynamically add more replicas to the cluster. So, with a 5 shard 1 replica
option, you will max out on 10 nodes (nothing will be left to be allocated
to the 11th node if started).

Sounds like you are planning on growing from 1-2 nodes to 3, this is still
well below the capacity provided by 5 shards and 1 replica. A 12gb index is
not a big one.

-shay.banon

On Wednesday, January 19, 2011 at 7:39 PM, Clinton Gormley wrote:

Hi Lee

In that case, here is my plan for increasing my shard count:

  1. Spin up a new ES server but make sure it doesn't join the existing
    server as a node.
  2. Pull the list of documents from my data and then get the documents
    from the existing ES server and put them on the new one.
  3. Once all the data is in the new ES server, shutdown the older one
    wipe its data and then start it up as a new node to join the new ES
    server.

Alternatively, instead of messing with a new server, you could:

  • create a new index 'new_index_timestamp' on the same ES server
  • index from 'my_index' to 'new_index_timestamp'
  • delete 'my_index'
  • create alias 'my_index' pointing to 'new_index_timestamp'

The benefit of switching to using aliases is that it will make it easier
to make changes to your index in the future.

clint


(Karussell) #10

is creating a new index an option?

querying several indices at once is easy ...

On 19 Jan., 18:29, Lee Parker l...@klparker.com wrote:

We have been using elastic search within our production environment for a
few months now. I have discovered that I didn't properly plan for the
amount of data in the index and find myself in need of increasing the number
of shards to improve performance. I am also going to increase the number of
nodes. I know that you can't just increase the number of shards without
reindexing the data. The difficulty in reindexing the data for me is that
our data currently resides in a highly normalized system and de-normalizing
each item into the document that ES indexes can take a long time (2.5 weeks
last time).

In that case, here is my plan for increasing my shard count:

  1. Spin up a new ES server but make sure it doesn't join the existing server
    as a node.
  2. Pull the list of documents from my data and then get the documents from
    the existing ES server and put them on the new one.
  3. Once all the data is in the new ES server, shutdown the older one wipe
    its data and then start it up as a new node to join the new ES server.

Is there a better way to keep things going and increase the shard count? We
are currently using the 5 shard default and all of the shards are around
12G. I am thinking about setting it to 15 shards and 1 (possibly 2)
replicas. Does this make sense? We are going to have a 3 node cluster by
the end of the month and will likely increase that as we see fit throughout
the year.

Lee

"It doesn't matter whether you are liberal or conservative, but it's
dangerous to always think with exclamation points instead of question
marks."
by Marty Beckerman


(Karussell) #11

ups, didn't saw shay's answers. so Lee: forget about my :wink:

On 20 Jan., 18:40, Karussell tableyourt...@googlemail.com wrote:

is creating a new index an option?

querying several indices at once is easy ...

On 19 Jan., 18:29, Lee Parker l...@klparker.com wrote:

We have been using elastic search within our production environment for a
few months now. I have discovered that I didn't properly plan for the
amount of data in the index and find myself in need of increasing the number
of shards to improve performance. I am also going to increase the number of
nodes. I know that you can't just increase the number of shards without
reindexing the data. The difficulty in reindexing the data for me is that
our data currently resides in a highly normalized system and de-normalizing
each item into the document that ES indexes can take a long time (2.5 weeks
last time).

In that case, here is my plan for increasing my shard count:

  1. Spin up a new ES server but make sure it doesn't join the existing server
    as a node.
  2. Pull the list of documents from my data and then get the documents from
    the existing ES server and put them on the new one.
  3. Once all the data is in the new ES server, shutdown the older one wipe
    its data and then start it up as a new node to join the new ES server.

Is there a better way to keep things going and increase the shard count? We
are currently using the 5 shard default and all of the shards are around
12G. I am thinking about setting it to 15 shards and 1 (possibly 2)
replicas. Does this make sense? We are going to have a 3 node cluster by
the end of the month and will likely increase that as we see fit throughout
the year.

Lee

"It doesn't matter whether you are liberal or conservative, but it's
dangerous to always think with exclamation points instead of question
marks."
by Marty Beckerman


(Clinton Gormley) #12

On Thu, 2011-01-20 at 11:26 -0600, Lee Parker wrote:

The current node has 1.2g available to the heap and shows that it is
using 1g of that. I will be building out to at least three nodes in
the next few weeks. Is 1.2g of heap enough when we have shards of
this size or larger?

I would say no. In fact, I'm really surprised you haven't had out of
memory issues with that amount of data and so little memory.

We have 14GB of data, divided into 5 shards, one replica, and we have
two nodes, each with 14GB of heap assigned to elasticsearch.

The more data that can live in memory, the faster ES will perform.

clint


(system) #13