From Kibana I started to visualize load errors and when checking the cluster I see that one of the servers is outside, I entered by SSH and the server is running and had the elasticsearch service running, when I restarted the service it did not come up again.
When I check the log I find that there is something in read-only mode, the disk is not full, the truth is I have no idea how to proceed in this case.
[2025-02-06T08:46:44,276][WARN ][o.e.c.a.s.ShardStateAction] [elastic-1] node closed while execution action [internal:cluster/shard/failure] for shard entry [FailedShardEntry{shardId [[.ds-ilm-history-7-2024.12.16-000060][0]], allocationId [oGr2qhxGQpK308BVu54asw], primary term [0], message [shard failure, reason [lucene commit failed]], markAsStale [true], failure [java.nio.file.FileSystemException: /home/data/elasticsearch/indices/9mK-GH_sQOyyipyf_naC-w/0/index/pending_segments_3t: Read-only file system
Typically that happens when the kernel sees some kind of fundamental error with the storage, often a hardware failure, so it flips the filesystem into read-only mode to try and limit the damage. If so, there'll be more details in the kernel logs, and really the only sensible fix is to replace the failing drive.
But, unless you understand what went wrong here, there is good chance it will happen again. If so, before restoring VM to previous snapshot again, please look at the logs on the VM, and likely on the hoist, to see what has happened.
++ Yes don't ignore this until it happens again, marking a filesystem as readonly is usually a reaction to a detectable error, but flaky hardware also makes undetectable errors.
Also, don't use VM snapshots to restore Elasticsearch nodes to an earlier state. It will lead to silent data loss and all sorts of other weirdness. See these docs:
Taking a snapshot is the only reliable and supported way to back up a cluster. [...] You must use the built-in snapshot functionality for cluster backups.
Thank you very much for your advice, the truth is that I didn't know about the snapshots issue and unintentionally I already did it, at this moment I have the problem again, one of the cluster servers is out and the other two are still working.
Absolutely, and I guarantee there is logs in both the VM itself, and likely on the host telling you, maybe in a non-obvious way, why.
Some examples, can be something entirely different, but the same block device is being presented to multiple VMs, some characteristic of the device has been changed outside the VM, there is a storage full error somewhere (over provisioned storage), some process within the VM is writing to the block device, bypassing the filesystem, same at the level above, ...
But, this is a system issue, and not an elasticsearch issue. Good luck.
unintentionally I already did it, at this moment I have the problem again
I rate it unlikely you did something wrong here to create this issue. But it's not clear what you mean by "I already did it".
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.