Shard allocation fails during rebalancing

We recently ran into a problem with our Elasticseach Cluster. We took a Data Node out of the cluster and completely reinstalled it with Debian 11.

When the node was done and rejoined the cluster, Elasticsearch started rebalancing as expected. However, this process only runs for a certain amount of time and then aborts. Here are the logs and the configuration of the master and the data node in question:

Also in dmesg you could see that the Elasticsearch process was blocked for over 4 minutes.

The cluster state was then YELLOW because all the shards that were already moved to the new data node by the rebalancing were suddenly unassigned. Elasticsearch then correctly initialized them elsewhere.

We had already successfully migrated other data nodes to Debian 11 at that time - so it doesn't seem to be a fundamental problem.

So the question is why it doesn't work on this particular server. The problem is reproducible, we have tried many attempts and also played around with some settings (decreased cluster.routing.allocation.cluster_concurrent_rebalance from 8 to 2, increased cluster.follower_lag.timeout from 90s to 180s / cluster.publish.timeout from 30s to 60s).

Our shard sizes are very different, some are only a few MB in size, others > 100GB.

Hopefully someone can help us debugging the root cause or point us in the right direction.

Thank you in advance!

What exactly did you see in the dmesg output?

The ClusterApplierService should have continued logging TRACE messages at some point. Can you share the rest of the log?

I strongly recommend leaving all the settings you mention at their defaults.

Finally, did you capture a stack dump of the node while it was stuck? It would be very helpful to see where it was spending all its time. If not, but you can reproduce this, run jstack every few seconds for a minute or so to capture some stack dumps.

Hi David, thanks for your reply.

you asked for several things:

The dmesg output is in my new gist. I have also included the complete rest of the logfile but there are no further TRACE messages to the ClusterApplierService.

I also did some stack dumps as you requested. The first stack dump was made right at the beginning of the rebalancing.

After 7 minutes then the second stack dump. Here everything was still fine.

The first signs of the problem are always when the master starts logging something like this:

[2021-10-14T12:40:59,699][INFO ][o.e.c.c.C.CoordinatorPublication] [es-master02] after [9.9s] publication of cluster state version [5430593] is still waiting for {es-node05-a}{AcV5mJL5T-SPk4LiV7x2qA}{VLvGSpD_Ru2tAdAO3P7rAg}{192.168.200.184}{192.168.200.184:19301}{cdhilrstw}{ml.machine_memory=67185217536, ml.max_open_jobs=20, xpack.installed=true, disks=ssd, machine=192.168.6.184, transform.node=true} [SENT_APPLY_COMMIT]
[2021-10-14T12:41:19,741][WARN ][o.e.c.c.C.CoordinatorPublication] [es-master02] after [30s] publication of cluster state version [5430593] is still waiting for {es-node05-a}{AcV5mJL5T-SPk4LiV7x2qA}{VLvGSpD_Ru2tAdAO3P7rAg}{192.168.200.184}{192.168.200.184:19301}{cdhilrstw}{ml.machine_memory=67185217536, ml.max_open_jobs=20, xpack.installed=true, disks=ssd, machine=192.168.6.184, transform.node=true} [SENT_APPLY_COMMIT]

At that moment, I started doing several stack dumps in short intervals. I included three of them in my gist (jstack_dump_3, jstack_dump_4, jstack_dump_5).

I forgot to mention that the data node does not recover on its own after this incident. It has to be rebooted every time to start a new attempt.

Thanks, that's great. It's stuck creating a directory:

"elasticsearch[es-node05-a][clusterApplierService#updateTask][T#1]" #44 daemon prio=5 os_prio=0 cpu=2034.28ms elapsed=820.11s tid=0x00007fcca9015800 nid=0x16d8 runnable  [0x00007fcc9f0f2000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.fs.UnixNativeDispatcher.mkdir0(java.base@11.0.12/Native Method)
	at sun.nio.fs.UnixNativeDispatcher.mkdir(java.base@11.0.12/UnixNativeDispatcher.java:229)
	at sun.nio.fs.UnixFileSystemProvider.createDirectory(java.base@11.0.12/UnixFileSystemProvider.java:385)
	at java.nio.file.Files.createDirectory(java.base@11.0.12/Files.java:690)
	at java.nio.file.Files.createAndCheckIsDirectory(java.base@11.0.12/Files.java:797)
	at java.nio.file.Files.createDirectories(java.base@11.0.12/Files.java:783)
	at org.elasticsearch.index.store.FsDirectoryFactory.newDirectory(FsDirectoryFactory.java:66)
...

There's no locks or other JDK magic involved in this process, we're in this very simple function that really just calls the libc mkdir function (which just executes the mkdir syscall):

Not sure how to proceed from here, we seem to be calling into the OS correctly, it's just not creating the directory we ask for but nor does it return an error. The dmesg output agrees: we're hung for minutes waiting for something IO-related. Maybe you have flaky hardware? Maybe it's a kernel or filesystem bug?

It ought to be possible to reproduce this outside of Elasticsearch - try stress-ng with some IO-related options including --dir but also some options that create/delete files like --filename, --hdd, --mmap/--mmap-file, --readahead, --rename, --seek, that sort of thing.

Thank you very much so far.

In fact, we also noticed that Elasticsearch sometimes does not manage to do the health checks on the data paths in the given time.

We already suspected the hard disks and checked all of them with fsck and also completely reformatted them.

We will try to reproduce the problem outside of Elasticsearch with stress-ng. Perhaps we can also leave out individual disks and see if it works better then (or replace the disks completely).

One more quick question: You would definitely recommend to leave the cluster_concurrent_rebalance at the default value of 2?

Yes. If you let the rebalancer look too far ahead then it can make some weird/suboptimal decisions. If you want faster shard movements, increase indices.recovery.max_bytes_per_sec: it defaults to 40MB/s which is pretty conservative.

Update: We tested each of the 6 SSDs for 15 minutes

stress-ng --cpu 8 --dir 8 --filename 8 --hdd 8 --mmap 8 --readahead 8 --rename 8 --seek 8 --timeout 900

At least 2 disks had problems. See dmseg:

The stress-ng process was blocked from this moment on - CPU load and I/O went down immediately. So basically the same phenomenon.

We don't know if more disks are affected (it could be that the stress test didn't run long enough). But it would be very unlikely if several disks show the same problem at the same time, which behaved totally normal before. As I said, we only upgraded to Debian 11 and there were no problems before. So we will reinstall completely tomorrow but it seems that the problem is not caused by Elasticsearch.

1 Like

Great, thanks for reporting back and I'm glad you found an easier way to reproduce the problem (although sorry to hear that there is a problem like this at all of course).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.