On scaling


(BARNEY) #1

Let us assume we create index with 2 shards and 1 replica on two nodes
in a elasticsearch box1.After indexing some documents,disk space(say
1000GB) in the box1 will get filled.
To index more documents, I will have 2 more nodes(one more box say
box2 has disk space 1000GB).
What is the mechanism to use total 4 nodes(two boxes) with only one
index having memory 2000GB?

What I am thinking is like If we create 2 more primary shards to the
box1 dynamically,then created shards will get allocated on the second
box.To do this I want to know how to create shards dynamically?

Apart from creating dynamic shards,any other mechanism to scale?


(David Pilato) #2

You can't modify shards after index creation.

HTH
David :wink:
@dadoonet

Le 2 déc. 2011 à 07:43, BARNEY kalyanc007@gmail.com a écrit :

Let us assume we create index with 2 shards and 1 replica on two nodes
in a elasticsearch box1.After indexing some documents,disk space(say
1000GB) in the box1 will get filled.
To index more documents, I will have 2 more nodes(one more box say
box2 has disk space 1000GB).
What is the mechanism to use total 4 nodes(two boxes) with only one
index having memory 2000GB?

What I am thinking is like If we create 2 more primary shards to the
box1 dynamically,then created shards will get allocated on the second
box.To do this I want to know how to create shards dynamically?

Apart from creating dynamic shards,any other mechanism to scale?


(Drew Raines) #3

BARNEY wrote:

Let us assume we create index with 2 shards and 1 replica on two
nodes in a elasticsearch box1.After indexing some documents,disk
space(say 1000GB) in the box1 will get filled. To index more
documents, I will have 2 more nodes(one more box say box2 has disk
space 1000GB). What is the mechanism to use total 4 nodes(two
boxes) with only one index having memory 2000GB?

What I am thinking is like If we create 2 more primary shards to
the box1 dynamically,then created shards will get allocated on the
second box.To do this I want to know how to create shards
dynamically?

Apart from creating dynamic shards,any other mechanism to scale?

As David mentioned, shards are not configurable after index
creation. You have a few options, and you may want to do all of them
in time.

The first is to create your index with more shards. This will give
you more horizontal node growth. You can't store a 10TiB index on
100 1TiB nodes if you only have 2 shards, because each shard would be
5TiB. So, calculate a reasonable number given the size of the
resources you have available.

The next tool at your disposal when you've exhausted the previous
step's capacity are aliases. You'll want to roll over into a new
index and create an alias that points to the old and new index. You
don't even have to do this step if your older data doesn't need to be
queried seamlessly. You could either disable it or leave it to be
queried manually, but an alias allows you to query both indices with
no client knowledge that they're separate. You can also set up alias
filters to enforce logical partitions of data.

At some point you'll exhaust the limits of that cluster. The only
option then will be to create another cluster and either migrate data
to it or use both clusters together. We do the latter through load
balancers with great success, managing 20 clusters of 1PiB of data.

-Drew


(Michael Sick) #4

Drew,

1PB! That's the biggest ES cluster I've seen referenced to date. Maybe you
could post some experiences/lessons learned to the "Please, tell about the
success story about ES usage on production" thread?

--Mike

On Fri, Dec 2, 2011 at 11:32 AM, Drew Raines aaraines@gmail.com wrote:

BARNEY wrote:

Let us assume we create index with 2 shards and 1 replica on two
nodes in a elasticsearch box1.After indexing some documents,disk
space(say 1000GB) in the box1 will get filled. To index more
documents, I will have 2 more nodes(one more box say box2 has disk
space 1000GB). What is the mechanism to use total 4 nodes(two
boxes) with only one index having memory 2000GB?

What I am thinking is like If we create 2 more primary shards to
the box1 dynamically,then created shards will get allocated on the
second box.To do this I want to know how to create shards
dynamically?

Apart from creating dynamic shards,any other mechanism to scale?

As David mentioned, shards are not configurable after index
creation. You have a few options, and you may want to do all of them
in time.

The first is to create your index with more shards. This will give
you more horizontal node growth. You can't store a 10TiB index on
100 1TiB nodes if you only have 2 shards, because each shard would be
5TiB. So, calculate a reasonable number given the size of the
resources you have available.

The next tool at your disposal when you've exhausted the previous
step's capacity are aliases. You'll want to roll over into a new
index and create an alias that points to the old and new index. You
don't even have to do this step if your older data doesn't need to be
queried seamlessly. You could either disable it or leave it to be
queried manually, but an alias allows you to query both indices with
no client knowledge that they're separate. You can also set up alias
filters to enforce logical partitions of data.

At some point you'll exhaust the limits of that cluster. The only
option then will be to create another cluster and either migrate data
to it or use both clusters together. We do the latter through load
balancers with great success, managing 20 clusters of 1PiB of data.

-Drew


(BARNEY) #5

Thanks Drew.
I have one more option.I will create one more index with 4 shards and
1 replica in the same cluster.Then I migrate old one(2shards
1replica) to newly created index(4shards 1replica).
which one is more practical and give good performance?

On Dec 2, 9:32 pm, Drew Raines aarai...@gmail.com wrote:

BARNEYwrote:

Let us assume we create index with 2 shards and 1 replica on two
nodes in a elasticsearch box1.After indexing some documents,disk
space(say 1000GB) in the box1 will get filled. To index more
documents, I will have 2 more nodes(one more box say box2 has disk
space 1000GB). What is the mechanism to use total 4 nodes(two
boxes) with only one index having memory 2000GB?

What I am thinking is like If we create 2 more primary shards to
the box1 dynamically,then created shards will get allocated on the
second box.To do this I want to know how to create shards
dynamically?

Apart from creating dynamic shards,any other mechanism to scale?

As David mentioned, shards are not configurable after index
creation. You have a few options, and you may want to do all of them
in time.

The first is to create your index with more shards. This will give
you more horizontal node growth. You can't store a 10TiB index on
100 1TiB nodes if you only have 2 shards, because each shard would be
5TiB. So, calculate a reasonable number given the size of the
resources you have available.

The next tool at your disposal when you've exhausted the previous
step's capacity are aliases. You'll want to roll over into a new
index and create an alias that points to the old and new index. You
don't even have to do this step if your older data doesn't need to be
queried seamlessly. You could either disable it or leave it to be
queried manually, but an alias allows you to query both indices with
no client knowledge that they're separate. You can also set up alias
filters to enforce logical partitions of data.

At some point you'll exhaust the limits of that cluster. The only
option then will be to create another cluster and either migrate data
to it or use both clusters together. We do the latter through load
balancers with great success, managing 20 clusters of 1PiB of data.

-Drew


(Shay Banon) #6

Which one compared to what? The simplest answer is over-allocate shards,
for a cluster with 4 nodes, create an index with 6-10 shards with 1
replica, this will allow you to grow to 6-10 nodes index size wise, and
6-10 * (number_of_replicas + 1) nodes. (just for that index, if you have
more indices, obviously you can add more nodes before you get to a "limit"
of 1 shard per node).

On Mon, Dec 5, 2011 at 8:08 AM, BARNEY kalyanc007@gmail.com wrote:

Thanks Drew.
I have one more option.I will create one more index with 4 shards and
1 replica in the same cluster.Then I migrate old one(2shards
1replica) to newly created index(4shards 1replica).
which one is more practical and give good performance?

On Dec 2, 9:32 pm, Drew Raines aarai...@gmail.com wrote:

BARNEYwrote:

Let us assume we create index with 2 shards and 1 replica on two
nodes in a elasticsearch box1.After indexing some documents,disk
space(say 1000GB) in the box1 will get filled. To index more
documents, I will have 2 more nodes(one more box say box2 has disk
space 1000GB). What is the mechanism to use total 4 nodes(two
boxes) with only one index having memory 2000GB?

What I am thinking is like If we create 2 more primary shards to
the box1 dynamically,then created shards will get allocated on the
second box.To do this I want to know how to create shards
dynamically?

Apart from creating dynamic shards,any other mechanism to scale?

As David mentioned, shards are not configurable after index
creation. You have a few options, and you may want to do all of them
in time.

The first is to create your index with more shards. This will give
you more horizontal node growth. You can't store a 10TiB index on
100 1TiB nodes if you only have 2 shards, because each shard would be
5TiB. So, calculate a reasonable number given the size of the
resources you have available.

The next tool at your disposal when you've exhausted the previous
step's capacity are aliases. You'll want to roll over into a new
index and create an alias that points to the old and new index. You
don't even have to do this step if your older data doesn't need to be
queried seamlessly. You could either disable it or leave it to be
queried manually, but an alias allows you to query both indices with
no client knowledge that they're separate. You can also set up alias
filters to enforce logical partitions of data.

At some point you'll exhaust the limits of that cluster. The only
option then will be to create another cluster and either migrate data
to it or use both clusters together. We do the latter through load
balancers with great success, managing 20 clusters of 1PiB of data.

-Drew


(Michael Sick) #7

Shay,

Any advice on what the min/max # of shards/node would be?

On Mon, Dec 5, 2011 at 2:25 PM, Shay Banon kimchy@gmail.com wrote:

Which one compared to what? The simplest answer is over-allocate shards,
for a cluster with 4 nodes, create an index with 6-10 shards with 1
replica, this will allow you to grow to 6-10 nodes index size wise, and
6-10 * (number_of_replicas + 1) nodes. (just for that index, if you have
more indices, obviously you can add more nodes before you get to a "limit"
of 1 shard per node).

On Mon, Dec 5, 2011 at 8:08 AM, BARNEY kalyanc007@gmail.com wrote:

Thanks Drew.
I have one more option.I will create one more index with 4 shards and
1 replica in the same cluster.Then I migrate old one(2shards
1replica) to newly created index(4shards 1replica).
which one is more practical and give good performance?

On Dec 2, 9:32 pm, Drew Raines aarai...@gmail.com wrote:

BARNEYwrote:

Let us assume we create index with 2 shards and 1 replica on two
nodes in a elasticsearch box1.After indexing some documents,disk
space(say 1000GB) in the box1 will get filled. To index more
documents, I will have 2 more nodes(one more box say box2 has disk
space 1000GB). What is the mechanism to use total 4 nodes(two
boxes) with only one index having memory 2000GB?

What I am thinking is like If we create 2 more primary shards to
the box1 dynamically,then created shards will get allocated on the
second box.To do this I want to know how to create shards
dynamically?

Apart from creating dynamic shards,any other mechanism to scale?

As David mentioned, shards are not configurable after index
creation. You have a few options, and you may want to do all of them
in time.

The first is to create your index with more shards. This will give
you more horizontal node growth. You can't store a 10TiB index on
100 1TiB nodes if you only have 2 shards, because each shard would be
5TiB. So, calculate a reasonable number given the size of the
resources you have available.

The next tool at your disposal when you've exhausted the previous
step's capacity are aliases. You'll want to roll over into a new
index and create an alias that points to the old and new index. You
don't even have to do this step if your older data doesn't need to be
queried seamlessly. You could either disable it or leave it to be
queried manually, but an alias allows you to query both indices with
no client knowledge that they're separate. You can also set up alias
filters to enforce logical partitions of data.

At some point you'll exhaust the limits of that cluster. The only
option then will be to create another cluster and either migrate data
to it or use both clusters together. We do the latter through load
balancers with great success, managing 20 clusters of 1PiB of data.

-Drew


(Shay Banon) #8

Hard to say, really depends on the environment and data you have. Note
though, if you have a single index with 100 shards (for example), then it
means a distributed search across 100 shards (unless you use routing).

On Mon, Dec 5, 2011 at 9:46 PM, Michael Sick <
michael.sick@serenesoftware.com> wrote:

Shay,

Any advice on what the min/max # of shards/node would be?

On Mon, Dec 5, 2011 at 2:25 PM, Shay Banon kimchy@gmail.com wrote:

Which one compared to what? The simplest answer is over-allocate shards,
for a cluster with 4 nodes, create an index with 6-10 shards with 1
replica, this will allow you to grow to 6-10 nodes index size wise, and
6-10 * (number_of_replicas + 1) nodes. (just for that index, if you have
more indices, obviously you can add more nodes before you get to a "limit"
of 1 shard per node).

On Mon, Dec 5, 2011 at 8:08 AM, BARNEY kalyanc007@gmail.com wrote:

Thanks Drew.
I have one more option.I will create one more index with 4 shards and
1 replica in the same cluster.Then I migrate old one(2shards
1replica) to newly created index(4shards 1replica).
which one is more practical and give good performance?

On Dec 2, 9:32 pm, Drew Raines aarai...@gmail.com wrote:

BARNEYwrote:

Let us assume we create index with 2 shards and 1 replica on two
nodes in a elasticsearch box1.After indexing some documents,disk
space(say 1000GB) in the box1 will get filled. To index more
documents, I will have 2 more nodes(one more box say box2 has disk
space 1000GB). What is the mechanism to use total 4 nodes(two
boxes) with only one index having memory 2000GB?

What I am thinking is like If we create 2 more primary shards to
the box1 dynamically,then created shards will get allocated on the
second box.To do this I want to know how to create shards
dynamically?

Apart from creating dynamic shards,any other mechanism to scale?

As David mentioned, shards are not configurable after index
creation. You have a few options, and you may want to do all of them
in time.

The first is to create your index with more shards. This will give
you more horizontal node growth. You can't store a 10TiB index on
100 1TiB nodes if you only have 2 shards, because each shard would be
5TiB. So, calculate a reasonable number given the size of the
resources you have available.

The next tool at your disposal when you've exhausted the previous
step's capacity are aliases. You'll want to roll over into a new
index and create an alias that points to the old and new index. You
don't even have to do this step if your older data doesn't need to be
queried seamlessly. You could either disable it or leave it to be
queried manually, but an alias allows you to query both indices with
no client knowledge that they're separate. You can also set up alias
filters to enforce logical partitions of data.

At some point you'll exhaust the limits of that cluster. The only
option then will be to create another cluster and either migrate data
to it or use both clusters together. We do the latter through load
balancers with great success, managing 20 clusters of 1PiB of data.

-Drew


(Drew Raines) #9

Michael Sick wrote:

Any advice on what the min/max # of shards/node would be?

Depends on how much searching you're doing. On a non-busy cluster
we've gotten beyond 250 shards/node on AWS m1.xlarge (15GiB RAM).
Avg size of those shards is 100-200GiB.

-Drew


(Drew Raines) #10

Michael Sick wrote:

1PB! That's the biggest ES cluster I've seen referenced to
date. Maybe you could post some experiences/lessons learned to the
"Please, tell about the success story about ES usage on production"
thread?

I haven't had the time to properly contribute to that thread, but
just to clarify: we don't have a PB in a single cluster. It's made
up of about 15 clusters ranging from 10-150TB.

I don't have any doubt that ES can scale to a PB though. It's
extremely horizontal given enough shards. We just choose not to
architect that way. You run into non-ES issues at that scale (and
well before it). :slight_smile:

-Drew


(system) #11