Elasticsearch - shards not splitted equally

Hello,
I'm running an RPM based on-prem cluster with 21 data nodes:

As shows, one node specifically has very high load, while logstashes complaining on bulk retries issue due to many writes to this node:

Correspondingly, this node gets excessive GC operations: (data nodes are 30GB heap, CMS GC)

So I wrote a short script to see the spread of active shards (from today which getting writes) among the cluster, and for some reason us-elkdb15 is having much more indices than the others:

Any explanation to this behavior? the high load on this node causes significant performance issues to the cluster.

Thanks,
Lior

Hi @Lior_Yakobov,

Please don't post images of text, they're impossible to search and quote and difficult to read for those of us with screenreaders.

I think it's likely that too many of today's shards ended up on one node because this node had the fewest number of shards at the time those indices were being created. There's a long-standing issue about this very situation.

For now, the usual fix is to set total_shards_per_node on each index to prevent too many of them from ending up on the same node. As the docs say, these settings can result in shards not being allocated if you set them too tight or the allocator gets into a corner.

Hey @DavidTurner,
Thanks for the quick response, but I'm afraid that total_shards_per_node is not matching my use case, as the total number of shard is equally shared between the cluster data nodes, the problem was that too many active shards (which getting indexed at the moment) were on one specific node while the rest of the nodes had much less active shards.
Is there any configuration for this use case? for example max_active_shards_per_node (suggestion) can be useful.

This scenario is happening for us many times and the entire work day is very hard as the cluster suffers from serious performance issues (this nodes hangs during API calls and therefore kibana crashes as well).
Today for example the shards are spread equally so the performance are much better:

[root@us-elkdb15 tmp]# source test.sh
today shards on node us-elkdb1:
6
today shards on node us-elkdb2:
7
today shards on node us-elkdb3:
4
today shards on node us-elkdb4:
5
today shards on node us-elkdb5:
7
today shards on node us-elkdb6:
6
today shards on node us-elkdb7:
4
today shards on node us-elkdb8:
6
today shards on node us-elkdb9:
6
today shards on node us-elkdb10:
4
today shards on node us-elkdb11:
4
today shards on node us-elkdb12:
6
today shards on node us-elkdb13:
5
today shards on node us-elkdb14:
6
today shards on node us-elkdb15:
5
today shards on node us-elkdb16:
5
today shards on node us-elkdb17:
6
today shards on node us-elkdb18:
5
today shards on node us-elkdb19:
5
today shards on node us-elkdb20:
4
today shards on node us-elkdb21:
4

[root@us-elkdb15 tmp]# curl "localhost:9200/_cat/nodes?v&h=name,ip,heap.percent,ram.percent,cpu,load_1m,load_5m,load_15m,node_role,master,version,jdk&s=name"
name ip heap.percent ram.percent cpu load_1m load_5m load_15m master version jdk
master1 209.87.212.78 6 64 0 0.00 0.01 0.05 - 6.6.0 11.0.2
master2 209.87.212.79 5 92 0 0.00 0.01 0.05 - 6.6.0 11.0.2
master3 209.87.212.80 65 96 5 0.38 0.59 0.64 * 6.6.0 11.0.2
us-elkdb1 209.87.212.27 67 100 24 4.68 4.73 4.74 - 6.6.0 11.0.2
us-elkdb10 209.87.212.48 67 100 14 7.76 6.40 5.37 - 6.6.0 11.0.2
us-elkdb11 209.87.212.107 69 99 23 4.45 4.18 3.90 - 6.6.0 11.0.2
us-elkdb12 209.87.212.108 62 99 44 5.80 5.19 5.47 - 6.6.0 11.0.2
us-elkdb13 209.87.212.109 50 99 36 5.38 3.63 3.51 - 6.6.0 11.0.2
us-elkdb14 209.87.212.110 62 99 24 3.69 3.52 3.52 - 6.6.0 11.0.2
us-elkdb15 209.87.212.119 75 100 32 6.02 6.73 6.74 - 6.6.0 11.0.2
us-elkdb16 209.87.212.120 68 100 13 3.63 3.67 3.67 - 6.6.0 11.0.2
us-elkdb17 209.87.212.121 70 100 16 7.56 6.52 6.24 - 6.6.0 11.0.2
us-elkdb18 209.87.212.122 66 99 18 4.74 5.06 5.06 - 6.6.0 11.0.2
us-elkdb19 209.87.212.123 69 100 22 6.59 7.06 7.24 - 6.6.0 11.0.2
us-elkdb2 209.87.212.28 71 100 32 6.03 6.39 6.20 - 6.6.0 11.0.2
us-elkdb20 209.87.212.124 74 100 18 6.14 6.33 5.86 - 6.6.0 11.0.2
us-elkdb21 209.87.212.125 63 99 19 5.38 4.61 4.46 - 6.6.0 11.0.2
us-elkdb3 209.87.212.29 65 100 17 3.39 3.91 4.21 - 6.6.0 11.0.2
us-elkdb4 209.87.212.39 65 100 10 6.14 4.93 4.11 - 6.6.0 11.0.2
us-elkdb5 209.87.212.40 66 99 19 4.17 6.26 6.30 - 6.6.0 11.0.2
us-elkdb6 209.87.212.41 73 99 15 4.98 5.47 5.42 - 6.6.0 11.0.2
us-elkdb7 209.87.212.42 73 99 7 1.74 2.02 2.37 - 6.6.0 11.0.2
us-elkdb8 209.87.212.43 69 99 19 7.09 7.50 7.36 - 6.6.0 11.0.2
us-elkdb9 209.87.212.47 67 100 11 7.73 6.47 5.22 - 6.6.0 11.0.2
us-elkhq1-kibana1 209.87.212.33 47 82 20 1.39 1.22 1.26 - 6.6.0 11.0.2
us-elkhq2-kibana2 209.87.212.207 70 97 21 1.52 1.72 1.83 - 6.6.0 11.0.2
us-elkhq3-kibana3 209.87.212.216 6 86 19 1.41 1.13 1.48 - 6.6.0 11.0.2

The question is how can I prevent this behavior again in the future.

Many thanks,
Lior

Hi Lior,

I think you misunderstood the setting David is talking about.
He's talking about total_shards_per_node index setting.
Not the cluster setting.

The cluster setting is to set the max shard count per node irrespective of the index the shard is part of.

The index setting, the one he was talking about is to control the max count of shards from the same index that can be on a node. Which can be used to prevent too many shards from the same index... today's index... from being on the same node.

To clarify further and discuss this further, let's clarify what you mean by an active shard, I'll give a definition and you can give yours if it's different.
An active shard is a shard of an index currently being indexed into. All shards of an index currently being indexed into are equally active shards. With daily indices, all shards of all indices for today's date are active. This is obviously indexing centric, because I think this is what you were talking about.

Now because you didn't share a proper output of
GET /_cat/indices?v&s=index
It's hard to give specific advice and see if David suggestion will be able to really prevent "unequally hot nodes" problem in your setup. Please share this info.

In the mean time I'll share how I alleviate this in my cluster and hopefully it helps and we can discuss how/if you do something similar after you've shared your "cat indices".

I have 20 data nodes, all same hardware. I want the indexing workload to be spread very equally as this is a log/metric cluster. If indexed into shards are not spread VERY equally on my 20 nodes I have a problem with some (or one) nodes becoming more busy than other nodes and that doesn't work well as my cluster have periods during the day where it runs pretty hot across the board. If it gets unbalanced in terms of hot shards, the nodes that work more than the other simply have too much work and they can't keep up.

What I do is that all my indices which receive (or can sometime receive) significant event/sec rate are configured to have a shard count which is a multiple of my node count. So 20,40,60, etc
For example my metricbeat index has 10 shards with 1 replica, so 20 shards. All my big indices in terms of indexing workload have a minimum of 20 shards.
20/20=1 If I can make sure that each node host 1 and only 1 of these, the workload is equally spread. The idea is that the workload has to be equally divisible or else it just can't be equally distributed. I could have 20 shards and 1 replica, so 40/20=2 and that would work too.

In that context (10 shards, 1rep = 20 shards) If I was using the setting suggested by David, maybe I would set it to:
index.routing.allocation.total_shards_per_node: 1
Which means each nodes can't have more than 1 of those 20. They all have 1, the workload is spread equally, profit.
There would be downsides if I did that though so maybe I would have to set it higher... Because if you set that to 1 in that context and you loose nodes, the setting could prevent the cluster from compensating for the lost nodes. The index could stay yellow or red until the nodes comes back, which can be pretty bad depending on how long you can have nodes missing.

For now at least, I elected to use other settings:
cluster.routing.allocation.balance.shard
cluster.routing.allocation.balance.index
cluster.routing.allocation.balance.threshold
https://www.elastic.co/guide/en/elasticsearch/reference/current/shards-allocation.html#_shard_balancing_heuristics
By putting a huge portion of the weight on the balance.index the cluster wants to spread the shards from the same index equally across the nodes. So for 20 shards and 20 nodes, it puts 1 on each node.
I'm telling the cluster to prefer balancing the shards from each index equally across the nodes vs preferring to balance the total shard count across the nodes irrespective of which index they are from. But none of that is specific to currently indexed into indices or shards. It does that for all indices, not just today's indices... But as long as it applies to today's indices, I get what I want. The fact that older indices are also flattened/spread equally as possible across the nodes, doesn't have downsides that I have identified so far. On the contrary, it also spreads the search load more equally across the nodes.

The downside I have identified and that I currently live with, is that my currently active indices are sometimes oversharded or undersharded according to the normal best practice of "in your usecase each shard should have a size of around x GBs each. I either use 20,40,60,etc so I can't at the same time, yield shards that are also perfectly targeting the specific size they should have.

My future will be a bit brighter I think. With ILM I plan to be able to abandon strictly time based rotation in favor of time and size based rotation. This will mean less over/under sharded indices.
Also with ILM I will be able to more easily shrink oversharded indices once they are no longer being indexed into. If only indexed into indices are oversharded, it means I have a LOT less over sharded indices all together.

Guys from Elastic might recommend against my method here, so stay careful and you can wait to get more feedback before trying to implement any of that. I certainly don't claim ES guru level.
But the only thing I have currently being able to implement to spread my load and prevent super hot nodes, is to use a shard count which is a multiple of my node count for all indexing heavy indices + the settings I mentionned above. If I used less than a multiple of node count, hotter nodes would be possible and so it would happen.

If I had 60 nodes, it would be a lot crazier to have all my big indices with a minimum of 60 shards. I would have to use:
https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-allocation-filtering.html
To put specific indices on specific groups of nodes and then I would do "index shard count is a multiple of the node count for the node group where this index is going". I could then control the node count that shard count needs to be a multiple of.

Interesting subject BTW, thanks for bringing it up!

1 Like

@martinr_ubi, Thank you for the extremely detailed answer, much appriciated!!!
From my understanding, too many shards in the cluster (means indices with unneeded high shard count) can also affect cluster performance (re-balancing and re-allocations duration gets longer).
But I wanted to ask, can you give an example for index size and documents count which you set 10 primary shards for?
In my cluster, currently my biggest index is around 100GB primary data, which I set 7 primary shards and 1 replica (total 14), estimated docs count in index is 85M.

I am considering of using your settings, "telling the cluster to prefer balancing the shards from each index equally across the nodes vs preferring to balance the total shard count across the nodes irrespective of which index they are from" is sounds matching for my use case.

Thanks again,
Lior

Hey @martinr_ubi, so after a couple of days that the shards were split kind of equally, look at my cluster status today:

today shards on node us-elkdb1:
3
today shards on node us-elkdb2:
0
today shards on node us-elkdb3:
0
today shards on node us-elkdb4:
0
today shards on node us-elkdb5:
1
today shards on node us-elkdb6:
0
today shards on node us-elkdb7:
2
today shards on node us-elkdb8:
2
today shards on node us-elkdb9:
1
today shards on node us-elkdb10:
1
today shards on node us-elkdb11:
1
today shards on node us-elkdb12:
2
today shards on node us-elkdb13:
0
today shards on node us-elkdb14:
0
today shards on node us-elkdb15:
1
today shards on node us-elkdb16:
0
today shards on node us-elkdb17:
0
today shards on node us-elkdb18:
0
today shards on node us-elkdb19:
3
today shards on node us-elkdb20:
49
today shards on node us-elkdb21:
48

from querying /_cat/shards I saw that this happens since all of the partitions of some indices are
located on these 2 nodes: (copying relevant parts)

te-apache-logs-2019.06.04 1 p STARTED 11908865 2.8gb 209.87.212.124 us-elkdb20
te-apache-logs-2019.06.04 4 p STARTED 11866367 2.9gb 209.87.212.124 us-elkdb20
te-apache-logs-2019.06.04 2 p STARTED 11838978 2.8gb 209.87.212.124 us-elkdb20
te-apache-logs-2019.06.04 3 p STARTED 11862548 2.8gb 209.87.212.124 us-elkdb20
te-apache-logs-2019.06.04 0 p STARTED 11904096 2.9gb 209.87.212.124 us-elkdb20
aws-reputation-service-2019.06.04 5 p STARTED 3165580 3.5gb 209.87.212.124 us-elkdb20
aws-reputation-service-2019.06.04 1 p STARTED 3131870 3.3gb 209.87.212.124 us-elkdb20
aws-reputation-service-2019.06.04 6 p STARTED 3162319 3.7gb 209.87.212.124 us-elkdb20
aws-reputation-service-2019.06.04 4 p STARTED 3158450 3.4gb 209.87.212.124 us-elkdb20
aws-reputation-service-2019.06.04 2 p STARTED 3146566 3.4gb 209.87.212.124 us-elkdb20
aws-reputation-service-2019.06.04 3 p STARTED 3154233 4.5gb 209.87.212.124 us-elkdb20
aws-reputation-service-2019.06.04 0 p STARTED 3138584 3.7gb 209.87.212.124 us-elkdb20
te-apache-logs-2019.06.04 1 r STARTED 11907461 2.8gb 209.87.212.125 us-elkdb21
te-apache-logs-2019.06.04 4 r STARTED 11864012 2.8gb 209.87.212.125 us-elkdb21
te-apache-logs-2019.06.04 2 r STARTED 11837206 2.9gb 209.87.212.125 us-elkdb21
te-apache-logs-2019.06.04 3 r STARTED 11860627 2.9gb 209.87.212.125 us-elkdb21
te-apache-logs-2019.06.04 0 r STARTED 11903103 2.9gb 209.87.212.125 us-elkdb21
aws-reputation-service-2019.06.04 5 r STARTED 3167078 3.7gb 209.87.212.125 us-elkdb21
aws-reputation-service-2019.06.04 1 r STARTED 3137881 3.5gb 209.87.212.125 us-elkdb21
aws-reputation-service-2019.06.04 6 r STARTED 3163876 3.8gb 209.87.212.125 us-elkdb21
aws-reputation-service-2019.06.04 4 r STARTED 3159819 3.9gb 209.87.212.125 us-elkdb21
aws-reputation-service-2019.06.04 2 r STARTED 3146566 4gb 209.87.212.125 us-elkdb21
aws-reputation-service-2019.06.04 3 r STARTED 3154639 4gb 209.87.212.125 us-elkdb21
aws-reputation-service-2019.06.04 0 r STARTED 3140056 3.5gb 209.87.212.125 us-elkdb21

From your previous comment, I understand that if I will raise the cluster.routing.allocation.balance.index from the default of 0.55f to 0.75f for example, it can help fixing this behavior right?

Thanks in advance,
Lior

I currently have this:

"balance": {
            "index": "0.99f",
            "shard": "0.01f"
          },

But you shouldn't assume I'm giving you specific advice down to that level. I was pointing you to the setting. I'm not in your context and can't make specific recommendation. I also don't work for Elastic.
I also have not mastered that setting myself because I didn't have time to experiment with it fully so can't claim full expertise of that setting.
But that's my current setting and I end up with completely flattened indices across nodes AND total shards are also equal across the nodes.

In your case because flattening your indices across your nodes can still lead to hotter nodes, it can't guarantee that all nodes will have an equal indexing workload. Randomness can still put unequal quantity of hot shards on nodes. That part of the work around is me having 10 shards 1 replica for indexing heavy indices. But it would be even more dumb of me to give you specific advice about that as there are over sharding risk and I can't evaluate that risk in your situation, just mine.

Also you have to allow rebalancing, etc But that is allowed by default. So with normal settings, if you haven't banned rebalancing and your cluster is in a state where its allowed to make balancing moves, it should start to move shards around if the setting is having the desired impact.

The moves will be to flatten the shards for the same indices across the nodes as that is what this setting influence. If you think it's too slow and you have extra bandwidth and ressources, you can accelerate balancing moves by upping the bandwidth they can move and the number of concurrent moves allowed. I also can't give specific advice about that. The best if you're not familiar is to keep the defaults. Those balancing settings are described in the doc very clearly.

Hey @martinr_ubi, so I started with a small change and made balance.index to 0.65f from 0.55f and the cluster state looks much better now:

[root@us-elkdb15 tmp]# source test.sh
today shards on node us-elkdb1:
5
today shards on node us-elkdb2:
7
today shards on node us-elkdb3:
5
today shards on node us-elkdb4:
6
today shards on node us-elkdb5:
6
today shards on node us-elkdb6:
5
today shards on node us-elkdb7:
6
today shards on node us-elkdb8:
5
today shards on node us-elkdb9:
4
today shards on node us-elkdb10:
8
today shards on node us-elkdb11:
6
today shards on node us-elkdb12:
5
today shards on node us-elkdb13:
7
today shards on node us-elkdb14:
5
today shards on node us-elkdb15:
4
today shards on node us-elkdb16:
6
today shards on node us-elkdb17:
7
today shards on node us-elkdb18:
4
today shards on node us-elkdb19:
5
today shards on node us-elkdb20:
7
today shards on node us-elkdb21:
5

I do have another question though, from the documentation and your exmaple I see that the sum of balance.index and balance.shard is 1.00 - if I made balance.index 0.65 but left balance.shard 0.45 can it cause issues? means if I raise one I need to decrease the other?

Many thanks,
Lior

For sure it's not going to unassign shards or anything truly bad like that, it influences the moves of an algorithm that tries to move stuff around when and if it can.

I also think it influences the placement decision at shard creation time too. Unless I'm confused it also helps with stuff like "a node breaks at 23:55 UTC, a new machine joins with an empty disk and at 00:00 UTC all the new day HOT shards want to go ON that empty node..."

If I'm wrong it's because it would do that bad thing... and would start moving them out immediately after. I don't remember the results when I tested that on recent versions.

I think it's safe to raise one without lowering the other, depending on what you want... Because if not, they would have said so? :zipper_mouth_face:

I made mine like that because the defaults add to 1.0 and I don't know any better.

Also want I wanted... was to lower the tendency to equalize all shards across all nodes because that means the cluster can sometimes put too much HOT shards on a single node...

I also wanted to raise the tendency to equalize indice's shards across the nodes, I want as flat a spreading as currently possible. If I have 20 hot shards from 1 index and 20 data nodes, I want 1 hot shards per node. So I want the number of shards from the same index to be "equal" across the nodes. (Repeat for all significant indices)

Keep in mind that unless I injured my head... no piece of code in there makes any difference between a hot shard and one that doesn't receive indexing requests. This concept exists through daily rotated indices but ES itself just sees shards. Exterior factors make it so that some of them do not receive indexing requests since 00:00 yesterday but it doesn't know that. A shard is a shard...

So the indexing work loads follow the HOT shards like seeking missiles and the cluster doesn't see anything wrong with putting all the hot shards on few of your nodes if the current state+settings+algorithms produce that placement result.

Indexing work load equal division among ES data nodes is (as far as I know) a hard problem to solve and the only knobs I know are listed at the bottom here.

I want my data nodes metrics to look like that: (grouped)

Also remember the alternatives. I elected to do that for now... You should consider everything:
https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-total-shards.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-allocation-filtering.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/shards-allocation.html#_shard_balancing_heuristics

Issues like those, and others I'm sure: (I'm probably overdue for reading all of them again to see where the brains like David are going with this :slight_smile: )


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.