Shard allocation fails during rebalancing

bunste · October 14, 2021, 10:07am

We recently ran into a problem with our Elasticseach Cluster. We took a Data Node out of the cluster and completely reinstalled it with Debian 11.

When the node was done and rejoined the cluster, Elasticsearch started rebalancing as expected. However, this process only runs for a certain amount of time and then aborts. Here are the logs and the configuration of the master and the data node in question:

gist.github.com

https://gist.github.com/bunste/5fcc43a4d6bc37527b8a0a8eaf0aaf85

data_node.log

[2021-10-13T14:15:08,671][TRACE][o.e.c.s.ClusterApplierService] [es-node05-a] connecting to nodes of cluster state with version 5429193
[2021-10-13T14:15:08,671][DEBUG][o.e.c.s.ClusterApplierService] [es-node05-a] applying settings from cluster state with version 5429193
[2021-10-13T14:15:08,671][DEBUG][o.e.c.s.ClusterApplierService] [es-node05-a] apply cluster state with version 5429193
[2021-10-13T14:15:08,671][TRACE][o.e.c.s.ClusterApplierService] [es-node05-a] calling [org.elasticsearch.repositories.RepositoriesService@4a0995e7] with change to version [5429193]
[2021-10-13T14:15:08,671][TRACE][o.e.c.s.ClusterApplierService] [es-node05-a] calling [org.elasticsearch.indices.cluster.IndicesClusterStateService@c418f17] with change to version [5429193]
[2021-10-13T14:17:09,136][INFO ][o.e.i.r.PeerRecoveryTargetService] [es-node05-a] recovery of [events-2021.10.05][0] from [{es-node06-a}{R00-RxIuQGud85biVet1XA}{rPoopo8EQQCMkkaXnIu0Xg}{192.168.200.185}{192.168.200.185:19301}{cdhilrstw}{ml.machine_memory=67085619200, ml.max_open_jobs=20, xpack.installed=true, disks=ssd, machine=192.168.6.185, transform.node=true}] interrupted by network disconnect, will retry in [5s]; cause: [[es-node05-a][192.168.200.184:19301][internal:index/shard/recovery/file_chunk] disconnected]
[2021-10-13T14:17:09,185][INFO ][o.e.i.r.PeerRecoveryTargetService] [es-node05-a] recovery of [groot_news_bucket_23_v3][0] from [{es-node03-a}{O_DOHlu7QqChJNdkgZHtbQ}{6F2-uSzeSka08lRdaE2VIw}{192.168.200.182}{192.168.200.182:19301}{cdhilrstw}{ml.machine_memory=135073177600, ml.max_open_jobs=20, xpack.installed=true, disks=ssd, machine=192.168.6.182, transform.node=true}] interrupted by network disconnect, will retry in [5s]; cause: [[es-node05-a][192.168.200.184:19301][internal:index/shard/recovery/file_chunk] disconnected]
[2021-10-13T14:17:11,607][INFO ][o.e.c.c.Coordinator      ] [es-node05-a] master node [{es-master02}{RLYFvvrgSCymDsgMDbx-dw}{LVDIbVeiQ4qtVEg8CaWq6w}{192.168.200.52}{192.168.200.52:9300}{ilmr}{ml.machine_memory=8376066048, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{es-master02}{RLYFvvrgSCymDsgMDbx-dw}{LVDIbVeiQ4qtVEg8CaWq6w}{192.168.200.52}{192.168.200.52:9300}{ilmr}{ml.machine_memory=8376066048, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] failed [3] consecutive checks
	at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:293) ~[elasticsearch-7.10.1.jar:7.10.1]

This file has been truncated. show original

data_node_config.yml

# Ansible managed

node.name: es-node05-a
node.attr.machine: 192.168.6.184
node.attr.disks: ssd
cluster.name: wilma_van_der_heel

node.master: false
node.data: true

This file has been truncated. show original

master_node.log

2021-10-14T11:03:30,026][INFO ][o.e.c.c.C.CoordinatorPublication] [es-master02] after [9.9s] publication of cluster state version [5430424] is still waiting for {es-node05-a}{AcV5mJL5T-SPk4LiV7x2qA}{IFo3Hlz0QjaPjhlXc3xfmQ}{192.168.200.184}{192.168.200.184:19301}{cdhilrstw}{ml.machine_memory=67185217536, ml.max_open_jobs=20, xpack.installed=true, disks=ssd, machine=192.168.6.184, transform.node=true} [SENT_APPLY_COMMIT]
[2021-10-14T11:03:50,053][WARN ][o.e.c.c.C.CoordinatorPublication] [es-master02] after [30s] publication of cluster state version [5430424] is still waiting for {es-node05-a}{AcV5mJL5T-SPk4LiV7x2qA}{IFo3Hlz0QjaPjhlXc3xfmQ}{192.168.200.184}{192.168.200.184:19301}{cdhilrstw}{ml.machine_memory=67185217536, ml.max_open_jobs=20, xpack.installed=true, disks=ssd, machine=192.168.6.184, transform.node=true} [SENT_APPLY_COMMIT]
[2021-10-14T11:04:12,298][INFO ][o.e.m.j.JvmGcMonitorService] [es-master02] [gc][23585102] overhead, spent [299ms] collecting in the last [1s]
[2021-10-14T11:04:42,444][INFO ][o.e.m.j.JvmGcMonitorService] [es-master02] [gc][23585132] overhead, spent [311ms] collecting in the last [1s]
[2021-10-14T11:04:53,449][INFO ][o.e.m.j.JvmGcMonitorService] [es-master02] [gc][23585143] overhead, spent [312ms] collecting in the last [1s]
[2021-10-14T11:05:19,461][INFO ][o.e.m.j.JvmGcMonitorService] [es-master02] [gc][23585169] overhead, spent [253ms] collecting in the last [1s]
[2021-10-14T11:05:20,053][WARN ][o.e.c.c.LagDetector      ] [es-master02] node [{es-node05-a}{AcV5mJL5T-SPk4LiV7x2qA}{IFo3Hlz0QjaPjhlXc3xfmQ}{192.168.200.184}{192.168.200.184:19301}{cdhilrstw}{ml.machine_memory=67185217536, ml.max_open_jobs=20, xpack.installed=true, disks=ssd, machine=192.168.6.184, transform.node=true}] is lagging at cluster state version [5430423], although publication of cluster state version [5430424] completed [1.5m] ago
[2021-10-14T11:05:23,455][INFO ][o.e.c.c.C.CoordinatorPublication] [es-master02] after [10.1s] publication of cluster state version [5430425] is still waiting for {es-node05-a}{AcV5mJL5T-SPk4LiV7x2qA}{IFo3Hlz0QjaPjhlXc3xfmQ}{192.168.200.184}{192.168.200.184:19301}{cdhilrstw}{ml.machine_memory=67185217536, ml.max_open_jobs=20, xpack.installed=true, disks=ssd, machine=192.168.6.184, transform.node=true} [SENT_APPLY_COMMIT]
[2021-10-14T11:05:23,528][INFO ][o.e.m.j.JvmGcMonitorService] [es-master02] [gc][23585173] overhead, spent [316ms] collecting in the last [1s]
[2021-10-14T11:05:33,556][INFO ][o.e.m.j.JvmGcMonitorService] [es-master02] [gc][23585183] overhead, spent [266ms] collecting in the last [1s]

This file has been truncated. show original

There are more than three files. show original

Also in dmesg you could see that the Elasticsearch process was blocked for over 4 minutes.

The cluster state was then YELLOW because all the shards that were already moved to the new data node by the rebalancing were suddenly unassigned. Elasticsearch then correctly initialized them elsewhere.

We had already successfully migrated other data nodes to Debian 11 at that time - so it doesn't seem to be a fundamental problem.

So the question is why it doesn't work on this particular server. The problem is reproducible, we have tried many attempts and also played around with some settings (decreased cluster.routing.allocation.cluster_concurrent_rebalance from 8 to 2, increased cluster.follower_lag.timeout from 90s to 180s / cluster.publish.timeout from 30s to 60s).

Our shard sizes are very different, some are only a few MB in size, others > 100GB.

Hopefully someone can help us debugging the root cause or point us in the right direction.

Thank you in advance!

DavidTurner · October 14, 2021, 10:13am

What exactly did you see in the dmesg output?

bunste:

[2021-10-13T14:15:08,671][TRACE][o.e.c.s.ClusterApplierService] [es-node05-a] calling [org.elasticsearch.indices.cluster.IndicesClusterStateService@c418f17] with change to version [5429193]

The ClusterApplierService should have continued logging TRACE messages at some point. Can you share the rest of the log?

I strongly recommend leaving all the settings you mention at their defaults.

Finally, did you capture a stack dump of the node while it was stuck? It would be very helpful to see where it was spending all its time. If not, but you can reproduce this, run jstack every few seconds for a minute or so to capture some stack dumps.

bunste · October 14, 2021, 11:26am

Hi David, thanks for your reply.

you asked for several things:

The dmesg output is in my new gist. I have also included the complete rest of the logfile but there are no further TRACE messages to the ClusterApplierService.

I also did some stack dumps as you requested. The first stack dump was made right at the beginning of the rebalancing.

After 7 minutes then the second stack dump. Here everything was still fine.

The first signs of the problem are always when the master starts logging something like this:

[2021-10-14T12:40:59,699][INFO ][o.e.c.c.C.CoordinatorPublication] [es-master02] after [9.9s] publication of cluster state version [5430593] is still waiting for {es-node05-a}{AcV5mJL5T-SPk4LiV7x2qA}{VLvGSpD_Ru2tAdAO3P7rAg}{192.168.200.184}{192.168.200.184:19301}{cdhilrstw}{ml.machine_memory=67185217536, ml.max_open_jobs=20, xpack.installed=true, disks=ssd, machine=192.168.6.184, transform.node=true} [SENT_APPLY_COMMIT]
[2021-10-14T12:41:19,741][WARN ][o.e.c.c.C.CoordinatorPublication] [es-master02] after [30s] publication of cluster state version [5430593] is still waiting for {es-node05-a}{AcV5mJL5T-SPk4LiV7x2qA}{VLvGSpD_Ru2tAdAO3P7rAg}{192.168.200.184}{192.168.200.184:19301}{cdhilrstw}{ml.machine_memory=67185217536, ml.max_open_jobs=20, xpack.installed=true, disks=ssd, machine=192.168.6.184, transform.node=true} [SENT_APPLY_COMMIT]

At that moment, I started doing several stack dumps in short intervals. I included three of them in my gist (jstack_dump_3, jstack_dump_4, jstack_dump_5).

gist.github.com

https://gist.github.com/bunste/ca12cdea833226cbc0ef916e8720fa67

data_node.log

[2021-10-13T14:15:08,671][TRACE][o.e.c.s.ClusterApplierService] [es-node05-a] connecting to nodes of cluster state with version 5429193
[2021-10-13T14:15:08,671][DEBUG][o.e.c.s.ClusterApplierService] [es-node05-a] applying settings from cluster state with version 5429193
[2021-10-13T14:15:08,671][DEBUG][o.e.c.s.ClusterApplierService] [es-node05-a] apply cluster state with version 5429193
[2021-10-13T14:15:08,671][TRACE][o.e.c.s.ClusterApplierService] [es-node05-a] calling [org.elasticsearch.repositories.RepositoriesService@4a0995e7] with change to version [5429193]
[2021-10-13T14:15:08,671][TRACE][o.e.c.s.ClusterApplierService] [es-node05-a] calling [org.elasticsearch.indices.cluster.IndicesClusterStateService@c418f17] with change to version [5429193]
[2021-10-13T14:17:09,136][INFO ][o.e.i.r.PeerRecoveryTargetService] [es-node05-a] recovery of [events-2021.10.05][0] from [{es-node06-a}{R00-RxIuQGud85biVet1XA}{rPoopo8EQQCMkkaXnIu0Xg}{192.168.200.185}{192.168.200.185:19301}{cdhilrstw}{ml.machine_memory=67085619200, ml.max_open_jobs=20, xpack.installed=true, disks=ssd, machine=192.168.6.185, transform.node=true}] interrupted by network disconnect, will retry in [5s]; cause: [[es-node05-a][192.168.200.184:19301][internal:index/shard/recovery/file_chunk] disconnected]
[2021-10-13T14:17:09,185][INFO ][o.e.i.r.PeerRecoveryTargetService] [es-node05-a] recovery of [groot_news_bucket_23_v3][0] from [{es-node03-a}{O_DOHlu7QqChJNdkgZHtbQ}{6F2-uSzeSka08lRdaE2VIw}{192.168.200.182}{192.168.200.182:19301}{cdhilrstw}{ml.machine_memory=135073177600, ml.max_open_jobs=20, xpack.installed=true, disks=ssd, machine=192.168.6.182, transform.node=true}] interrupted by network disconnect, will retry in [5s]; cause: [[es-node05-a][192.168.200.184:19301][internal:index/shard/recovery/file_chunk] disconnected]
[2021-10-13T14:17:11,607][INFO ][o.e.c.c.Coordinator      ] [es-node05-a] master node [{es-master02}{RLYFvvrgSCymDsgMDbx-dw}{LVDIbVeiQ4qtVEg8CaWq6w}{192.168.200.52}{192.168.200.52:9300}{ilmr}{ml.machine_memory=8376066048, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{es-master02}{RLYFvvrgSCymDsgMDbx-dw}{LVDIbVeiQ4qtVEg8CaWq6w}{192.168.200.52}{192.168.200.52:9300}{ilmr}{ml.machine_memory=8376066048, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}] failed [3] consecutive checks
	at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:293) ~[elasticsearch-7.10.1.jar:7.10.1]

This file has been truncated. show original

dmesg_output

[Thu Oct 14 12:35:00 2021] INFO: task jbd2/sdf1-8:730 blocked for more than 120 seconds.
[Thu Oct 14 12:35:00 2021]       Tainted: G          I       5.10.0-9-amd64 #1 Debian 5.10.70-1
[Thu Oct 14 12:35:00 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Oct 14 12:35:00 2021] task:jbd2/sdf1-8     state:D stack:    0 pid:  730 ppid:     2 flags:0x00004000
[Thu Oct 14 12:35:00 2021] Call Trace:
[Thu Oct 14 12:35:00 2021]  __schedule+0x282/0x870
[Thu Oct 14 12:35:00 2021]  ? out_of_line_wait_on_bit_lock+0xb0/0xb0
[Thu Oct 14 12:35:00 2021]  schedule+0x46/0xb0
[Thu Oct 14 12:35:00 2021]  io_schedule+0x42/0x70
[Thu Oct 14 12:35:00 2021]  bit_wait_io+0xd/0x50

This file has been truncated. show original

jstack_dump_1

2021-10-14 12:27:45
Full thread dump OpenJDK 64-Bit Server VM (11.0.12+7-post-Debian-2 mixed mode, sharing):

Threads class SMR info:
_java_thread_list=0x00007fcd54001ef0, length=71, elements={
0x00007fcdd5f22800, 0x00007fcdd5f24800, 0x00007fcdd5f2a000, 0x00007fcdd5f2c000,
0x00007fcdd5f2e000, 0x00007fcdd5f30000, 0x00007fcdd5f32000, 0x00007fcdd5f65800,
0x00007fcdd6603800, 0x00007fcdd6620800, 0x00007fcdd7613000, 0x00007fcdd761b800,
0x00007fcca8415800, 0x00007fcdd670e000, 0x00007fcdd7a88000, 0x00007fcdd7a81800,
0x00007fcdd778f000, 0x00007fcca8dd1000, 0x00007fcca8dcd800, 0x00007fcca900e000,

This file has been truncated. show original

There are more than three files. show original

I forgot to mention that the data node does not recover on its own after this incident. It has to be rebooted every time to start a new attempt.

DavidTurner · October 14, 2021, 12:21pm

Thanks, that's great. It's stuck creating a directory:

"elasticsearch[es-node05-a][clusterApplierService#updateTask][T#1]" #44 daemon prio=5 os_prio=0 cpu=2034.28ms elapsed=820.11s tid=0x00007fcca9015800 nid=0x16d8 runnable  [0x00007fcc9f0f2000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.fs.UnixNativeDispatcher.mkdir0(java.base@11.0.12/Native Method)
	at sun.nio.fs.UnixNativeDispatcher.mkdir(java.base@11.0.12/UnixNativeDispatcher.java:229)
	at sun.nio.fs.UnixFileSystemProvider.createDirectory(java.base@11.0.12/UnixFileSystemProvider.java:385)
	at java.nio.file.Files.createDirectory(java.base@11.0.12/Files.java:690)
	at java.nio.file.Files.createAndCheckIsDirectory(java.base@11.0.12/Files.java:797)
	at java.nio.file.Files.createDirectories(java.base@11.0.12/Files.java:783)
	at org.elasticsearch.index.store.FsDirectoryFactory.newDirectory(FsDirectoryFactory.java:66)
...

There's no locks or other JDK magic involved in this process, we're in this very simple function that really just calls the libc mkdir function (which just executes the mkdir syscall):

github.com

openjdk/jdk/blob/jdk-11+12/src/java.base/unix/native/libnio/fs/UnixNativeDispatcher.c#L761-L771

    
      
          JNIEXPORT void JNICALL
          Java_sun_nio_fs_UnixNativeDispatcher_mkdir0(JNIEnv* env, jclass this,
              jlong pathAddress, jint mode)
          {
              const char* path = (const char*)jlong_to_ptr(pathAddress);
          
          
    /* EINTR not listed as a possible error */
              if (mkdir(path, (mode_t)mode) == -1) {
                  throwUnixException(env, errno);
              }
          }

Not sure how to proceed from here, we seem to be calling into the OS correctly, it's just not creating the directory we ask for but nor does it return an error. The dmesg output agrees: we're hung for minutes waiting for something IO-related. Maybe you have flaky hardware? Maybe it's a kernel or filesystem bug?

DavidTurner · October 14, 2021, 12:27pm

It ought to be possible to reproduce this outside of Elasticsearch - try stress-ng with some IO-related options including --dir but also some options that create/delete files like --filename, --hdd, --mmap/--mmap-file, --readahead, --rename, --seek, that sort of thing.

bunste · October 14, 2021, 1:44pm

Thank you very much so far.

In fact, we also noticed that Elasticsearch sometimes does not manage to do the health checks on the data paths in the given time.

We already suspected the hard disks and checked all of them with fsck and also completely reformatted them.

We will try to reproduce the problem outside of Elasticsearch with stress-ng. Perhaps we can also leave out individual disks and see if it works better then (or replace the disks completely).

One more quick question: You would definitely recommend to leave the cluster_concurrent_rebalance at the default value of 2?

DavidTurner · October 14, 2021, 1:58pm

Yes. If you let the rebalancer look too far ahead then it can make some weird/suboptimal decisions. If you want faster shard movements, increase indices.recovery.max_bytes_per_sec: it defaults to 40MB/s which is pretty conservative.

bunste · October 14, 2021, 3:21pm

Update: We tested each of the 6 SSDs for 15 minutes

stress-ng --cpu 8 --dir 8 --filename 8 --hdd 8 --mmap 8 --readahead 8 --rename 8 --seek 8 --timeout 900

At least 2 disks had problems. See dmseg:

gist.github.com

https://gist.github.com/shahedsalehian/0caa288dfa3fb8efe6475d79ca929dff

gistfile1.txt

[ 5800.688705] INFO: task jbd2/sdf1-8:667 blocked for more than 120 seconds.
[ 5800.688749]       Tainted: G          I       5.10.0-9-amd64 #1 Debian 5.10.70-1
[ 5800.688783] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5800.688818] task:jbd2/sdf1-8     state:D stack:    0 pid:  667 ppid:     2 flags:0x00004000
[ 5800.688822] Call Trace:
[ 5800.688832]  __schedule+0x282/0x870
[ 5800.688835]  ? out_of_line_wait_on_bit_lock+0xb0/0xb0
[ 5800.688837]  schedule+0x46/0xb0
[ 5800.688839]  io_schedule+0x42/0x70
[ 5800.688841]  bit_wait_io+0xd/0x50

This file has been truncated. show original

gistfile2.txt

[Thu Oct 14 16:21:56 2021] INFO: task kworker/u50:2:139 blocked for more than 120 seconds.
[Thu Oct 14 16:21:56 2021]       Tainted: G          I       5.10.0-9-amd64 #1 Debian 5.10.70-1
[Thu Oct 14 16:21:56 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Oct 14 16:21:56 2021] task:kworker/u50:2   state:D stack:    0 pid:  139 ppid:     2 flags:0x00004000
[Thu Oct 14 16:21:56 2021] Workqueue: writeback wb_workfn (flush-8:64)
[Thu Oct 14 16:21:56 2021] Call Trace:
[Thu Oct 14 16:21:56 2021]  __schedule+0x282/0x870
[Thu Oct 14 16:21:56 2021]  schedule+0x46/0xb0
[Thu Oct 14 16:21:56 2021]  io_schedule+0x42/0x70
[Thu Oct 14 16:21:56 2021]  wait_on_page_bit_common+0x116/0x3b0

This file has been truncated. show original

The stress-ng process was blocked from this moment on - CPU load and I/O went down immediately. So basically the same phenomenon.

We don't know if more disks are affected (it could be that the stress test didn't run long enough). But it would be very unlikely if several disks show the same problem at the same time, which behaved totally normal before. As I said, we only upgraded to Debian 11 and there were no problems before. So we will reinstall completely tomorrow but it seems that the problem is not caused by Elasticsearch.

DavidTurner · October 14, 2021, 3:40pm

Great, thanks for reporting back and I'm glad you found an easier way to reproduce the problem (although sorry to hear that there is a problem like this at all of course).

system · November 11, 2021, 3:40pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shard Allocation Problem Elasticsearch	3	326	July 6, 2017
Shard Allocation Problem Elasticsearch	3	721	July 6, 2017
Shard rebalancing is slow after network failure on any node Elasticsearch	7	1379	February 19, 2019
Proper way to restart elasticsearch in a cluster Elasticsearch	5	183	April 17, 2024
Shard Rebalancing Delay Elasticsearch	3	794	July 5, 2017

Shard allocation fails during rebalancing

Related topics