Read-only file system

juancamiloll · February 6, 2025, 2:37pm

Hello everyone,

From Kibana I started to visualize load errors and when checking the cluster I see that one of the servers is outside, I entered by SSH and the server is running and had the elasticsearch service running, when I restarted the service it did not come up again.

When I check the log I find that there is something in read-only mode, the disk is not full, the truth is I have no idea how to proceed in this case.

[2025-02-06T08:46:44,276][WARN ][o.e.c.a.s.ShardStateAction] [elastic-1] node closed while execution action [internal:cluster/shard/failure] for shard entry [FailedShardEntry{shardId [[.ds-ilm-history-7-2024.12.16-000060][0]], allocationId [oGr2qhxGQpK308BVu54asw], primary term [0], message [shard failure, reason [lucene commit failed]], markAsStale [true], failure [java.nio.file.FileSystemException: /home/data/elasticsearch/indices/9mK-GH_sQOyyipyf_naC-w/0/index/pending_segments_3t: Read-only file system

Filesystem                         Size  Used Avail Use% Mounted on
tmpfs                              3.2G  1.2M  3.2G   1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv   28G   13G   14G  48% /
tmpfs                               16G     0   16G   0% /dev/shm
tmpfs                              5.0M     0  5.0M   0% /run/lock
/dev/sda2                          2.0G  243M  1.6G  14% /boot
/dev/sda1                          1.1G  6.1M  1.1G   1% /boot/efi
/dev/sdb                           590G  140G  420G  25% /home
tmpfs                              3.2G  4.0K  3.2G   1% /run/user/1000

root@elastic-1:/var/log/elasticsearch# systemctl status elasticsearch
× elasticsearch.service - Elasticsearch
     Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2025-02-06 09:05:01 -05; 22min ago
       Docs: https://www.elastic.co
    Process: 65270 ExecStart=/usr/share/elasticsearch/bin/systemd-entrypoint -p ${PID_DIR}/elasticsearch.pid --quiet (code=exited, status=70)
   Main PID: 65270 (code=exited, status=70)
        CPU: 3.544s
root@elastic-1:/var/log/elasticsearch# systemctl stop elasticsearch
root@elastic-1:/var/log/elasticsearch# systemctl start elasticsearch
Job for elasticsearch.service failed because the control process exited with error code.
See "systemctl status elasticsearch.service" and "journalctl -xeu elasticsearch.service" for details.

DavidTurner · February 6, 2025, 2:53pm

Typically that happens when the kernel sees some kind of fundamental error with the storage, often a hardware failure, so it flips the filesystem into read-only mode to try and limit the damage. If so, there'll be more details in the kernel logs, and really the only sensible fix is to replace the failing drive.

juancamiloll · February 6, 2025, 2:58pm

Hello, thank you very much for replying.

Fortunately I had a snapshot of the previous day's virtual machine and by restoring I was able to fix the problem.

I will keep your comments in mind in case the error occurs again to inform the administrator of the virtual machines.

RainTown · February 6, 2025, 3:35pm

@juancamiloll Good that you were able to recover.

But, unless you understand what went wrong here, there is good chance it will happen again. If so, before restoring VM to previous snapshot again, please look at the logs on the VM, and likely on the hoist, to see what has happened.

DavidTurner · February 6, 2025, 4:11pm

++ Yes don't ignore this until it happens again, marking a filesystem as readonly is usually a reaction to a detectable error, but flaky hardware also makes undetectable errors.

Also, don't use VM snapshots to restore Elasticsearch nodes to an earlier state. It will lead to silent data loss and all sorts of other weirdness. See these docs:

Taking a snapshot is the only reliable and supported way to back up a cluster. [...] You must use the built-in snapshot functionality for cluster backups.

juancamiloll · February 6, 2025, 6:09pm

@DavidTurner @RainTown

Thank you very much for your advice, the truth is that I didn't know about the snapshots issue and unintentionally I already did it, at this moment I have the problem again, one of the cluster servers is out and the other two are still working.

What do you recommend me to do then?

DavidTurner · February 6, 2025, 8:51pm

I'd suggest speaking with the folks that run your infrastructure. Something is very very wrong if filesystems are being marked readonly.

RainTown · February 6, 2025, 9:03pm

Absolutely, and I guarantee there is logs in both the VM itself, and likely on the host telling you, maybe in a non-obvious way, why.

Some examples, can be something entirely different, but the same block device is being presented to multiple VMs, some characteristic of the device has been changed outside the VM, there is a storage full error somewhere (over provisioned storage), some process within the VM is writing to the block device, bypassing the filesystem, same at the level above, ...

But, this is a system issue, and not an elasticsearch issue. Good luck.

unintentionally I already did it, at this moment I have the problem again

I rate it unlikely you did something wrong here to create this issue. But it's not clear what you mean by "I already did it".

Topic		Replies	Views
Cluster locks up if master node filesystem becomes read-only Elasticsearch	3	2697	July 5, 2017
Shard relocation failed due to filesystem read-only Elasticsearch	2	1155	December 13, 2017
Elasticsearch on read-only filesystem? Elasticsearch	5	3273	July 6, 2017
Issue while restoring snapshot in cluster environment Elasticsearch	8	1226	July 5, 2017
How to make ES cluster resilient to FileSystemException Elasticsearch	10	2777	July 5, 2017

Read-only file system

Related topics