Index balance in the cluster

Hello All,
I have noticed that I have 1 server out of 5 in the cluster, that stores the majority of the data/indices.
Also, one of the nodes holds a lower number of shards.

How can I balance the indices across the nodes?
How can I balance the shards across the nodes?

This is the status of the cluster

NODE INDEX# SHARDS#
01 571 455
02 571 457
03 288 400
04 571 449
05 571 446

Thanks

What does disk usage look like on your nodes?

Disk usage for data directory is

Node Size of data/index
1 271G
2 257G
3 140G
4 277G
5 262G

So looks like node 3 is used less by elastic.
I expected it to be evenly split between nodes. So each node will have 240G of data

What is the history, i.e. were there more indexes before, or perhaps unbalanced when that node was down and some indexes created, etc.? Are all nodes the same in terms of disk/ RAM/JVM, etc.? And what is typical index setting for shards/replicas?

And ANY routing going on, for HA in a cloud (allocation awareness), etc.? Any playing with all those settings now or in the past?

Do you often create new indexes (like daily) and close, purge them? Any closed indexes (which won't show in many lists, but use space)?

Shards are balanced (not indexes) via disk space and other factors - also some settings for this (like cluster.routing.allocation.balance.shard which I've not played with but would seem to force re-balancing):
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html

Hello @Steve_Mushero
Thanks for the reply.
Here are the answers for your questions:

  • No serious history in the past. There were less indices in the past. the cluster has grown a bit in the past couple of weeks.
  • All VMs are identical.
  • Typical index setting is 1 primary and 1 replica shard. Some will have 2 primary and 2 replica.
  • Indices are created hourly, daily, and weekly.
  • Some indices are purged hourly, some weekly.
  • Most setting are elastics' default. No changes to shards settings.

About closed index that won't show but use space.
How can I know know? How can I find if there are any such cases?

Thanks! :airplane:

@AClerk
I am assuming cluster rebalance is not blocked and node3 disk does not have anything other than ES data. The later is important as available disk space matters not the size of the disk. You can verify this using node stats API _nodes/<node_id>/stats/fs?pretty or _nodes/stats/fs?pretty.

Unless you are experiencing performance issue I won't worry about it. In my experience, unless you hit disk high watermark or certain node is overloaded ES doesn't move shards. Because moving a shard is an expensive operation, consumes network bandwidth, incurs gc and loses caches. When I new shard is allocated it takes available disk space into account. But ES cannot predict how big this shard going to grow in size. It treats all shards equally. Some shards may grow faster but won't be moved unless you hit watermark.

If you are experiencing performance issue, you need to first decide which shard to move. This depends on your query volume on each index and shard size. You can use then use cluster route api to move specific shard(s).

For a long term solution, I would explore something on the lines of ILM since you are using time based shards. Your cluster size is small. So you need to assess based on query volume on the old data.

Finally for some reason you want all nodes to have roughly equal utilization, compute (total data size on all nodes * 100) / (5 * disk size). The set low and high watermarks slightly higher than that. This will force ES to rebalance. Once rebalancing completes you can set those back to default. Make sure you create hourly indices for several hours and may be even daily for a day or two before changing watermark. I won't recommend this as I don't see major benefits but it can go wrong badly if not done correctly. Including option only for information.

1 Like

Hi @Vinayak_Sapre
thanks for your reply.
I do have performance issues and I am truing to find the root cause.
It is not necessarily because of nodes balance.

I have more than 2TB available on the disks for each node. So I think this is not an issue.

Thnaks!

@AClerk

I would analyze slow query log and look at shard size / distribution of those indices. Also look at queries written correctly.

Are you seeing significantly different CPU / IO utilization on this node?

1 Like

Based on the different time periods covered by the indices it sounds like you could be having indices of very different sizes. How large is your largest index? How many indices around that size do you have? How large is your smallest index?

@Christian_Dahlqvist
Size by doc count? Store size?
I might have indices of different sizes. How is that affecting the cluster?

@Vinayak_Sapre
I am analysing slowlogs.
Still trying to understand if that node is acting differently and how.

Slow won't affect where shards go - we have a new visual view we are working on for the cluster and by index to see where things are going; couple other tools have a bit of that, but there must be some reason over time, especially if you are creating new indexes all the time.

By size, I'm pretty sure Christian mans in types/store size; doc counts don't matter as they can be 10B or 100MB each. Size might affect where things go, especially if huge, I guess.

I am wondering whether you may get have a few very large shards that would skew the balance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.