Unbalanced disk usage with ES 2.4.x

hgfxng · November 21, 2017, 4:39pm

Hi,

On a 7 nodes cluster, I had to replace two of the nodes. For that I used cluster routing allocation exclusion lists to make shards be moved away of those two nodes in order to shutdown ES on both.

Then I added new nodes to replace them, with same settings (same name, same IP, etc), and set the cluster routing allocation exclusion parameters to empty in order to allow it to move shards to the new nodes. ES indeed moved shards to them, but one of the new nodes has too much data on it while other nodes has a lot of free space. For example, I have nodes with less than 30% of disk usage while one node has 65% of disk usage.

I have lots of deleted docs in lots of shards/segments/indices, which I intend to fix with forcing a merge, but I am afraid of triggering this considering that one of the nodes is totally unbalanced in disk usage.

So, the questions are:

I can't remove the settings transient.cluster.routing.allocation.exclude.* without downtime (restarting the whole cluster) - Could their presence make ES not do proper allocation or rebalancing shards properly?
How does ES distributes data in the cluster? Is it based on number of segments, shards, disk usage, etc?
How does deleted documents affects data distribution?

best wishes

Herbert

hgfxng · November 22, 2017, 8:04am

Ok, I just found out that the problem is that there is dirty data that does not belong to the node in question, which might explain the weird disk usage.

However I find documentation lacking on how to handle this kind of issue.

warkolm · November 22, 2017, 7:51pm

It should clear things up when they are moved away/deleted. Can you locate a directory on disk that shouldn't be there?

hgfxng · November 23, 2017, 8:44am

Thanks for the info.

After adding 2 new nodes to the cluster, the rebalance also trigger a cleanup on the node that had "dirty" data.

Is there an API to check that shows stuff about the "dirty" data? What should I look for?

warkolm · November 23, 2017, 10:13pm

There's not because there is no concept of that in Elasticsearch. Once you delete data it's gone.

If it appears to be hanging around then we'd need to look at the filesystem to see what is/isn't happening.

hgfxng · November 24, 2017, 8:47am

Ok. If it happens again I will update this.

warkolm · November 24, 2017, 8:48am

Thanks!

hgfxng · November 29, 2017, 3:21pm

I managed to get new logs after trying to add new nodes to the cluster. I've got many logged errors like the one bellow, after ES moved a few shards.

[2017-11-29 16:19:36,609][DEBUG][cluster.service          ] [els10] processing [indices_store ([[v3-large-customer-data-60][14]] active fully on other nodes)]: execute
[2017-11-29 16:19:36,609][DEBUG][indices.store            ] [els10] [v3-large-customer-data-60][14] failed to delete unallocated shard, ignoring
org.apache.lucene.store.LockObtainFailedException: Can't lock shard [v3-large-customer-data-60][14], timed out after 0ms
	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:609)
	at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:537)
	at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:506)
	at org.elasticsearch.env.NodeEnvironment.deleteShardDirectorySafe(NodeEnvironment.java:344)
	at org.elasticsearch.indices.IndicesService.deleteShardStore(IndicesService.java:578)
	at org.elasticsearch.indices.store.IndicesStore$ShardActiveResponseHandler$1.execute(IndicesStore.java:303)
	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
	at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:480)
	at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:784)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

system · December 27, 2017, 3:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unbalanced disk usage with ES 6.1.3 Elasticsearch	4	2554	May 1, 2018
Shard rebalance issue Elasticsearch	5	1451	August 20, 2019
Fundamental question about ES data/shards Elasticsearch	3	417	July 6, 2017
Shard rebalance issue on addition of new data nodes across the cluster Elasticsearch	6	1562	November 24, 2017
Shards allocation unbalanced on nodes Elasticsearch	2	1949	July 11, 2018

Unbalanced disk usage with ES 2.4.x

Related topics