High CPU Load on only some of the machines in a cluster

Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10
minutes. When we put the cluster under load though, only 3 of the 7 nodes
get any load on them, and that quickly builds up to a point where queries
are timing out.

We originally had a replication factor of 3 set, and have just updated it
to 6 (shard count is 20 per index), however this does not seem to have done
the trick

Any ideas?

Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM, Chris Rode wrote:

Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?

Hey Shay

Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
head I can see that each index is not spread evenly. Trying to use
index.routing.allocation.total_shards_per_node (set to 15) transiently
now...

Anything else I should try?

On Feb 21, 12:12 am, Shay Banon kim...@gmail.com wrote:

Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM, Chris Rode wrote:

Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?

UPDATE: Unfortunately applying this transiently baulked.. however I
have applied manually to each node in the cluster. After the dust
settled, the shards still aren't distributed fairly.
We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
= 12 so this should be fine) and some nodes still have 20 shards on an
index.

We are using 18.6 is this an 18.7 only setting by any chance?

On Feb 21, 5:16 pm, Chris Rode cir...@gmail.com wrote:

Hey Shay

Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
head I can see that each index is not spread evenly. Trying to use
index.routing.allocation.total_shards_per_node (set to 15) transiently
now...

Anything else I should try?

On Feb 21, 12:12 am, Shay Banon kim...@gmail.com wrote:

Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM, Chris Rode wrote:

Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?

The thing to do is to do a jstack on the alas. search process. (Use
paste bin to report link here). Do it during load. We will be able to
tell which is happening. So far I noted at least few major overload
conditions. 1). When index is cold and you are doing bulk inserts,
you may get bloom filter load to kill your disk IO? (Shay, is there
way to pace this somehow, the IOwait and waiting threads go to
500-1000). 2). You are doing massive merges or flushes, potentially
your refresh is not set to -1 when you are indexing. 3). Your disks
are too slow, check iostat -xdm 1 report for 30 seconds. 4). You are
running heap memory. (check log for oome msgs).

-Jack

On Feb 21, 3:32 pm, Chris Rode cir...@gmail.com wrote:

UPDATE: Unfortunately applying this transiently baulked.. however I
have applied manually to each node in the cluster. After the dust
settled, the shards still aren't distributed fairly.
We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
= 12 so this should be fine) and some nodes still have 20 shards on an
index.

We are using 18.6 is this an 18.7 only setting by any chance?

On Feb 21, 5:16 pm, Chris Rode cir...@gmail.com wrote:

Hey Shay

Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
head I can see that each index is not spread evenly. Trying to use
index.routing.allocation.total_shards_per_node (set to 15) transiently
now...

Anything else I should try?

On Feb 21, 12:12 am, Shay Banon kim...@gmail.com wrote:

Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM, Chris Rode wrote:

Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?

Thanks for your offer to help jack

http://pastebin.com/ZFHJzX9Y

That is the Jstack trace you asked for. We have 4 nodes now on
m1.xlarge with an mdadm raid 0 across the 4 drives.

We are doing indexing whilst we run the load as well as straight
queries. We havnt turned off the refresh but have have turned it down
to 10seconds

gateway.index_refresh: 10

I have also skitched a copy of the cluster allocation tables in
elasticsearch-head

Any help you can give would be much appreciated

Cheers

On Feb 22, 2:08 pm, Jack Levin magn...@gmail.com wrote:

The thing to do is to do a jstack on the alas. search process. (Use
paste bin to report link here). Do it during load. We will be able to
tell which is happening. So far I noted at least few major overload
conditions. 1). When index is cold and you are doing bulk inserts,
you may get bloom filter load to kill your disk IO? (Shay, is there
way to pace this somehow, the IOwait and waiting threads go to
500-1000). 2). You are doing massive merges or flushes, potentially
your refresh is not set to -1 when you are indexing. 3). Your disks
are too slow, check iostat -xdm 1 report for 30 seconds. 4). You are
running heap memory. (check log for oome msgs).

-Jack

On Feb 21, 3:32 pm,ChrisRodecir...@gmail.com wrote:

UPDATE: Unfortunately applying this transiently baulked.. however I
have applied manually to each node in the cluster. After the dust
settled, the shards still aren't distributed fairly.
We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
= 12 so this should be fine) and some nodes still have 20 shards on an
index.

We are using 18.6 is this an 18.7 only setting by any chance?

On Feb 21, 5:16 pm,ChrisRodecir...@gmail.com wrote:

Hey Shay

Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
head I can see that each index is not spread evenly. Trying to use
index.routing.allocation.total_shards_per_node (set to 15) transiently
now...

Anything else I should try?

On Feb 21, 12:12 am, Shay Banon kim...@gmail.com wrote:

Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM,ChrisRodewrote:

Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?

The setting only applies in 0.19, but you did not answer the question with a bit more detailed information (how many indices, how many shards, are they not evenly balanced)?

On Wednesday, February 22, 2012 at 1:32 AM, Chris Rode wrote:

UPDATE: Unfortunately applying this transiently baulked.. however I
have applied manually to each node in the cluster. After the dust
settled, the shards still aren't distributed fairly.
We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
= 12 so this should be fine) and some nodes still have 20 shards on an
index.

We are using 18.6 is this an 18.7 only setting by any chance?

On Feb 21, 5:16 pm, Chris Rode <cir...@gmail.com (http://gmail.com)> wrote:

Hey Shay

Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
head I can see that each index is not spread evenly. Trying to use
index.routing.allocation.total_shards_per_node (set to 15) transiently
now...

Anything else I should try?

On Feb 21, 12:12 am, Shay Banon <kim...@gmail.com (http://gmail.com)> wrote:

Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM, Chris Rode wrote:

Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?

Sorry Shay

The image attached via skitch shows the cluster and distribution at
the time of the Jstack trace

We have 4 nodes, with 3 indexes and a replication factor of 3. Each
index has 20 shards. These are running across two availability zones
on amazon.

When we turn down the replication factor the nodes get an even number
of shards, however they do not get an even number of shards per index.
This means that indexes with higher load do not utilise the entire
clusters resources

We are currently using 0.18.7

On Feb 27, 6:22 am, Shay Banon kim...@gmail.com wrote:

The setting only applies in 0.19, but you did not answer the question with a bit more detailed information (how many indices, how many shards, are they not evenly balanced)?

On Wednesday, February 22, 2012 at 1:32 AM, Chris Rode wrote:

UPDATE: Unfortunately applying this transiently baulked.. however I
have applied manually to each node in the cluster. After the dust
settled, the shards still aren't distributed fairly.
We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
= 12 so this should be fine) and some nodes still have 20 shards on an
index.

We are using 18.6 is this an 18.7 only setting by any chance?

On Feb 21, 5:16 pm, Chris Rode <cir...@gmail.com (http://gmail.com)> wrote:

Hey Shay

Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
head I can see that each index is not spread evenly. Trying to use
index.routing.allocation.total_shards_per_node (set to 15) transiently
now...

Anything else I should try?

On Feb 21, 12:12 am, Shay Banon <kim...@gmail.com (http://gmail.com)> wrote:

Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM, Chris Rode wrote:

Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?

This new skitch illustrates what I was saying about the index-node
balancing behavior

On Feb 27, 10:25 am, Chris Rode cir...@gmail.com wrote:

Sorry Shay

The image attached via skitch shows the cluster and distribution at
the time of the Jstack trace

We have 4 nodes, with 3 indexes and a replication factor of 3. Each
index has 20 shards. These are running across two availability zones
on amazon.

When we turn down the replication factor the nodes get an even number
of shards, however they do not get an even number of shards per index.
This means that indexes with higher load do not utilise the entire
clusters resources

We are currently using 0.18.7

On Feb 27, 6:22 am, Shay Banon kim...@gmail.com wrote:

The setting only applies in 0.19, but you did not answer the question with a bit more detailed information (how many indices, how many shards, are they not evenly balanced)?

On Wednesday, February 22, 2012 at 1:32 AM,ChrisRodewrote:

UPDATE: Unfortunately applying this transiently baulked.. however I
have applied manually to each node in the cluster. After the dust
settled, the shards still aren't distributed fairly.
We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
= 12 so this should be fine) and some nodes still have 20 shards on an
index.

We are using 18.6 is this an 18.7 only setting by any chance?

On Feb 21, 5:16 pm,ChrisRode<cir...@gmail.com (http://gmail.com)> wrote:

Hey Shay

Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
head I can see that each index is not spread evenly. Trying to use
index.routing.allocation.total_shards_per_node (set to 15) transiently
now...

Anything else I should try?

On Feb 21, 12:12 am, Shay Banon <kim...@gmail.com (http://gmail.com)> wrote:

Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM,ChrisRodewrote:

Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?

Chris,

ES does not distribute shards to accomplish " even number of shards per
index". It distributes to have even number of shards per node in total.
Currently in ES, when you have indices with drastically different sizes,
you end up with unbalanced load.
Why do you have a separate index with 20 shards for the index with 106
documents, is it supposed to grow? The easiest way to resolve the problem
is to have only a single index with different types for 2nd and 3rd index,
rather than having separate indices.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Sun, Feb 26, 2012 at 7:36 PM, Chris Rode cirode@gmail.com wrote:

This new skitch illustrates what I was saying about the index-node
balancing behavior

Skitch | Evernote

On Feb 27, 10:25 am, Chris Rode cir...@gmail.com wrote:

Sorry Shay

The image attached via skitch shows the cluster and distribution at
the time of the Jstack trace

We have 4 nodes, with 3 indexes and a replication factor of 3. Each
index has 20 shards. These are running across two availability zones
on amazon.

When we turn down the replication factor the nodes get an even number
of shards, however they do not get an even number of shards per index.
This means that indexes with higher load do not utilise the entire
clusters resources

We are currently using 0.18.7

On Feb 27, 6:22 am, Shay Banon kim...@gmail.com wrote:

The setting only applies in 0.19, but you did not answer the question
with a bit more detailed information (how many indices, how many shards,
are they not evenly balanced)?

On Wednesday, February 22, 2012 at 1:32 AM,ChrisRodewrote:

UPDATE: Unfortunately applying this transiently baulked.. however I
have applied manually to each node in the cluster. After the dust
settled, the shards still aren't distributed fairly.
We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
= 12 so this should be fine) and some nodes still have 20 shards on
an
index.

We are using 18.6 is this an 18.7 only setting by any chance?

On Feb 21, 5:16 pm,ChrisRode<cir...@gmail.com (http://gmail.com)>
wrote:

Hey Shay

Yes we do, we have 3, each with 20 shards. Looking with
elasticsearch-
head I can see that each index is not spread evenly. Trying to use
index.routing.allocation.total_shards_per_node (set to 15)
transiently
now...

Anything else I should try?

On Feb 21, 12:12 am, Shay Banon <kim...@gmail.com (
http://gmail.com)> wrote:

Do you have multiple indices in the cluster? If so, maybe you
end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM,ChrisRodewrote:

Hi all

I have not been able to find anything to explain why this may
be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3
gateway set to 10 minutes. When we put the cluster under load though, only
3 of the 7 nodes get any load on them, and that quickly builds up to a
point where queries are timing out.

We originally had a replication factor of 3 set, and have just
updated it to 6 (shard count is 20 per index), however this does not seem
to have done the trick

Any ideas?

As Berkay mentioned, the balancing theme is across indices, though I have started to work on smarter one that will do better cross indices case. In 0.19, I added a bandaid (for now) where you can force, on the index level, max number of shards per node, so you can force even balancing (take total_number_of_shards / number_of_nodes, where total_number_of_shards is (1+number_of_replicas) * number_of_shards).

On Monday, February 27, 2012 at 5:12 AM, Berkay Mollamustafaoglu wrote:

Chris,

ES does not distribute shards to accomplish " even number of shards per index". It distributes to have even number of shards per node in total. Currently in ES, when you have indices with drastically different sizes, you end up with unbalanced load.
Why do you have a separate index with 20 shards for the index with 106 documents, is it supposed to grow? The easiest way to resolve the problem is to have only a single index with different types for 2nd and 3rd index, rather than having separate indices.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Sun, Feb 26, 2012 at 7:36 PM, Chris Rode <cirode@gmail.com (mailto:cirode@gmail.com)> wrote:

This new skitch illustrates what I was saying about the index-node
balancing behavior

Skitch | Evernote

On Feb 27, 10:25 am, Chris Rode <cir...@gmail.com (mailto:cir...@gmail.com)> wrote:

Sorry Shay

The image attached via skitch shows the cluster and distribution at
the time of the Jstack trace

We have 4 nodes, with 3 indexes and a replication factor of 3. Each
index has 20 shards. These are running across two availability zones
on amazon.

When we turn down the replication factor the nodes get an even number
of shards, however they do not get an even number of shards per index.
This means that indexes with higher load do not utilise the entire
clusters resources

We are currently using 0.18.7

On Feb 27, 6:22 am, Shay Banon <kim...@gmail.com (mailto:kim...@gmail.com)> wrote:

The setting only applies in 0.19, but you did not answer the question with a bit more detailed information (how many indices, how many shards, are they not evenly balanced)?

On Wednesday, February 22, 2012 at 1:32 AM,ChrisRodewrote:

UPDATE: Unfortunately applying this transiently baulked.. however I
have applied manually to each node in the cluster. After the dust
settled, the shards still aren't distributed fairly.
We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
= 12 so this should be fine) and some nodes still have 20 shards on an
index.

We are using 18.6 is this an 18.7 only setting by any chance?

On Feb 21, 5:16 pm,ChrisRode<cir...@gmail.com (mailto:cir...@gmail.com) (http://gmail.com)> wrote:

Hey Shay

Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
head I can see that each index is not spread evenly. Trying to use
index.routing.allocation.total_shards_per_node (set to 15) transiently
now...

Anything else I should try?

On Feb 21, 12:12 am, Shay Banon <kim...@gmail.com (mailto:kim...@gmail.com) (http://gmail.com)> wrote:

Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM,ChrisRodewrote:

Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?

Hi,

I am having a similar experience. In my case, 1 node is pegged while
others have only 25-50% CPU utilization. That over-burdened node is
hosting 4 index shards while the others have 3; it is not the master.
None of the nodes are swapping.

Configuration.

Running v0.18.7. I have only a single index: 5 shards, 1 replica. ES
Head reports it as ~1mil documents and ~1.8g in size. The index is
configured to be entirely in memory with the S3 gateway. ES_MAX_MEM is
at 4g. I'm running it on a 3-node cluster of m1.large instances.

Workload.

I started with a bulk-insert of all the data. After that, a couple
schema tweaks resulted in additional loads of a portion of he data
set. The querying being run is "match_all" with a "geo_distance"
filter applied. At this point no other fields are being filtered. No
writes are happening durring this read test.

Why is this single machine the bottle-neck? How can I go about
distributing this load more evenly? Is there a minimum cluster size or
specific cluster configuration I should be using for this kind of
load? I'm curious how I can even go about diagnosing the issue.

Thanks,
Nick

On Feb 27, 5:09 am, Shay Banon kim...@gmail.com wrote:

As Berkay mentioned, the balancing theme is across indices, though I have started to work on smarter one that will do better cross indices case. In 0.19, I added a bandaid (for now) where you can force, on the index level, max number of shards per node, so you can force even balancing (take total_number_of_shards / number_of_nodes, where total_number_of_shards is (1+number_of_replicas) * number_of_shards).

On Monday, February 27, 2012 at 5:12 AM, Berkay Mollamustafaoglu wrote:

Chris,

ES does not distribute shards to accomplish " even number of shards per index". It distributes to have even number of shards per node in total. Currently in ES, when you have indices with drastically different sizes, you end up with unbalanced load.
Why do you have a separate index with 20 shards for the index with 106 documents, is it supposed to grow? The easiest way to resolve the problem is to have only a single index with different types for 2nd and 3rd index, rather than having separate indices.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Sun, Feb 26, 2012 at 7:36 PM, Chris Rode <cir...@gmail.com (mailto:cir...@gmail.com)> wrote:

This new skitch illustrates what I was saying about the index-node
balancing behavior

Skitch | Evernote

On Feb 27, 10:25 am, Chris Rode <cir...@gmail.com (mailto:cir...@gmail.com)> wrote:

Sorry Shay

The image attached via skitch shows the cluster and distribution at
the time of the Jstack trace

We have 4 nodes, with 3 indexes and a replication factor of 3. Each
index has 20 shards. These are running across two availability zones
on amazon.

When we turn down the replication factor the nodes get an even number
of shards, however they do not get an even number of shards per index.
This means that indexes with higher load do not utilise the entire
clusters resources

We are currently using 0.18.7

On Feb 27, 6:22 am, Shay Banon <kim...@gmail.com (mailto:kim...@gmail.com)> wrote:

The setting only applies in 0.19, but you did not answer the question with a bit more detailed information (how many indices, how many shards, are they not evenly balanced)?

On Wednesday, February 22, 2012 at 1:32 AM,ChrisRodewrote:

UPDATE: Unfortunately applying this transiently baulked.. however I
have applied manually to each node in the cluster. After the dust
settled, the shards still aren't distributed fairly.
We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
= 12 so this should be fine) and some nodes still have 20 shards on an
index.

We are using 18.6 is this an 18.7 only setting by any chance?

On Feb 21, 5:16 pm,ChrisRode<cir...@gmail.com (mailto:cir...@gmail.com) (http://gmail.com)> wrote:

Hey Shay

Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
head I can see that each index is not spread evenly. Trying to use
index.routing.allocation.total_shards_per_node (set to 15) transiently
now...

Anything else I should try?

On Feb 21, 12:12 am, Shay Banon <kim...@gmail.com (mailto:kim...@gmail.com) (http://gmail.com)> wrote:

Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM,ChrisRodewrote:

Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?

Thanks for that Shay

(I was wondering why that setting wasnt taking effect on 0.18.7 :)).
That setting will help distribute the shards in the multi-index case.
For now we have simply combined our indexes. We are now down to 2 and
it is more evently distributed due to that change

We ended up getting the cluster to cope with the amount of reads by
switching on the caching for certain filters. So far this is working
for us... although before the cache warms we still get the problem
that Nick Mentions below, where one or two machines get high load and
the rest dont show any.. this disapears as the cache gets populated

On Feb 28, 12:09 am, Shay Banon kim...@gmail.com wrote:

As Berkay mentioned, the balancing theme is across indices, though I have started to work on smarter one that will do better cross indices case. In 0.19, I added a bandaid (for now) where you can force, on the index level, max number of shards per node, so you can force even balancing (take total_number_of_shards / number_of_nodes, where total_number_of_shards is (1+number_of_replicas) * number_of_shards).

On Monday, February 27, 2012 at 5:12 AM, Berkay Mollamustafaoglu wrote:

Chris,

ES does not distribute shards to accomplish " even number of shards per index". It distributes to have even number of shards per node in total. Currently in ES, when you have indices with drastically different sizes, you end up with unbalanced load.
Why do you have a separate index with 20 shards for the index with 106 documents, is it supposed to grow? The easiest way to resolve the problem is to have only a single index with different types for 2nd and 3rd index, rather than having separate indices.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Sun, Feb 26, 2012 at 7:36 PM, Chris Rode <cir...@gmail.com (mailto:cir...@gmail.com)> wrote:

This new skitch illustrates what I was saying about the index-node
balancing behavior

Skitch | Evernote

On Feb 27, 10:25 am, Chris Rode <cir...@gmail.com (mailto:cir...@gmail.com)> wrote:

Sorry Shay

The image attached via skitch shows the cluster and distribution at
the time of the Jstack trace

We have 4 nodes, with 3 indexes and a replication factor of 3. Each
index has 20 shards. These are running across two availability zones
on amazon.

When we turn down the replication factor the nodes get an even number
of shards, however they do not get an even number of shards per index.
This means that indexes with higher load do not utilise the entire
clusters resources

We are currently using 0.18.7

On Feb 27, 6:22 am, Shay Banon <kim...@gmail.com (mailto:kim...@gmail.com)> wrote:

The setting only applies in 0.19, but you did not answer the question with a bit more detailed information (how many indices, how many shards, are they not evenly balanced)?

On Wednesday, February 22, 2012 at 1:32 AM,ChrisRodewrote:

UPDATE: Unfortunately applying this transiently baulked.. however I
have applied manually to each node in the cluster. After the dust
settled, the shards still aren't distributed fairly.
We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
= 12 so this should be fine) and some nodes still have 20 shards on an
index.

We are using 18.6 is this an 18.7 only setting by any chance?

On Feb 21, 5:16 pm,ChrisRode<cir...@gmail.com (mailto:cir...@gmail.com) (http://gmail.com)> wrote:

Hey Shay

Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
head I can see that each index is not spread evenly. Trying to use
index.routing.allocation.total_shards_per_node (set to 15) transiently
now...

Anything else I should try?

On Feb 21, 12:12 am, Shay Banon <kim...@gmail.com (mailto:kim...@gmail.com) (http://gmail.com)> wrote:

Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM,ChrisRodewrote:

Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?

Hi there,

For what it worths, I have the same issue here. We upgraded our cluster to
0.18.7 (6 servers, 1 index, 20 shards, 1 replica) two days ago and set max
and min heap as 4GB. Afterwards we started to index 50G of docs at about
150 docs/sec, and one of the servers showed really high loads (between 20
and 40) and CPU usage was 200% in a 4 cores CPU. After restarting that
server, other server showed similar behavior.

Thanks,
Frederic

On Monday, 27 February 2012 10:09:39 UTC-3, kimchy wrote:

As Berkay mentioned, the balancing theme is across indices, though I have
started to work on smarter one that will do better cross indices case. In
0.19, I added a bandaid (for now) where you can force, on the index level,
max number of shards per node, so you can force even balancing (take
total_number_of_shards / number_of_nodes, where total_number_of_shards is
(1+number_of_replicas) * number_of_shards).

On Monday, February 27, 2012 at 5:12 AM, Berkay Mollamustafaoglu wrote: