Shard allocation fails during rebalancing

bunste · October 14, 2021, 3:21pm

Update: We tested each of the 6 SSDs for 15 minutes

stress-ng --cpu 8 --dir 8 --filename 8 --hdd 8 --mmap 8 --readahead 8 --rename 8 --seek 8 --timeout 900

At least 2 disks had problems. See dmseg:

gist.github.com

https://gist.github.com/shahedsalehian/0caa288dfa3fb8efe6475d79ca929dff

gistfile1.txt

[ 5800.688705] INFO: task jbd2/sdf1-8:667 blocked for more than 120 seconds.
[ 5800.688749]       Tainted: G          I       5.10.0-9-amd64 #1 Debian 5.10.70-1
[ 5800.688783] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5800.688818] task:jbd2/sdf1-8     state:D stack:    0 pid:  667 ppid:     2 flags:0x00004000
[ 5800.688822] Call Trace:
[ 5800.688832]  __schedule+0x282/0x870
[ 5800.688835]  ? out_of_line_wait_on_bit_lock+0xb0/0xb0
[ 5800.688837]  schedule+0x46/0xb0
[ 5800.688839]  io_schedule+0x42/0x70
[ 5800.688841]  bit_wait_io+0xd/0x50

This file has been truncated. show original

gistfile2.txt

[Thu Oct 14 16:21:56 2021] INFO: task kworker/u50:2:139 blocked for more than 120 seconds.
[Thu Oct 14 16:21:56 2021]       Tainted: G          I       5.10.0-9-amd64 #1 Debian 5.10.70-1
[Thu Oct 14 16:21:56 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Oct 14 16:21:56 2021] task:kworker/u50:2   state:D stack:    0 pid:  139 ppid:     2 flags:0x00004000
[Thu Oct 14 16:21:56 2021] Workqueue: writeback wb_workfn (flush-8:64)
[Thu Oct 14 16:21:56 2021] Call Trace:
[Thu Oct 14 16:21:56 2021]  __schedule+0x282/0x870
[Thu Oct 14 16:21:56 2021]  schedule+0x46/0xb0
[Thu Oct 14 16:21:56 2021]  io_schedule+0x42/0x70
[Thu Oct 14 16:21:56 2021]  wait_on_page_bit_common+0x116/0x3b0

This file has been truncated. show original

The stress-ng process was blocked from this moment on - CPU load and I/O went down immediately. So basically the same phenomenon.

We don't know if more disks are affected (it could be that the stress test didn't run long enough). But it would be very unlikely if several disks show the same problem at the same time, which behaved totally normal before. As I said, we only upgraded to Debian 11 and there were no problems before. So we will reinstall completely tomorrow but it seems that the problem is not caused by Elasticsearch.

Topic		Replies	Views
Shard Allocation Problem Elasticsearch	3	348	July 6, 2017
Shard Allocation Problem Elasticsearch	3	739	July 6, 2017
Elasticsearch rolling restart problem Elasticsearch	5	446	July 6, 2017
How to handle system failures in Elasticsearch cluster Elasticsearch	5	419	July 6, 2017
Unassigned shards -- and two shards rebalancing Elasticsearch	7	1051	December 12, 2023

Shard allocation fails during rebalancing

Related topics