Shared File System - Repository Issues

Hello, Hope you guys are doing well!

During the last few years, we have been using the Snapshot & Restore feature, over a Shared File System via SMB, and everything seemed to work well.

We did some restores earlier this year, using a repository named “Snapshots_2“ on “/snapshots2“, that resided in a TrueNas.

This month, we started to have a couple issues, that ill break down our actions, for better understanding.

  1. Earlier this month (august 2025), we upgraded to a new TrueNas, mantaining every single disk (500TB~ of data) from the previous TrueNas.
  2. For some unknown reason, once we reactivated this repository in the new TrueNas, elasticsearch would “verify“ the repository successfully, although it was not possible to read/restore any of the snapshots.
  3. We then started to debug this situation, and couldnt find a proper reason. We did recreate the Repository in the “Repositories feature“ of Snapshot & Restore and Rebooted the TrueNas.
  4. With that, when the Repository been recreated, elastic started itself a “Repository Clean up“ process that took aproximately 24h to finish.
  5. A team member started then the ILM and some snapshots been taken (1581 to be precise).
  6. Although, I’ve manage to find out that the older ones, were not accessible, and that we currently have 2 index-N under /snapshots2 folder. (1 related to older snapshots with 116750 snapshots, and one related to the “1581“ ones).

image

  1. We recreated again the Repository in the “Repositories feature“ of Snapshot & Restore, and we now are reading the older snapshots (under the 116750).

Those, the snapshots under the 116750 are the ones that are worrying us, because they are about 500TB of data, and are throwing some issues:

a. we can list the snapshots successfully with

GET /_snapshot/Snapshots_2/all-1hour-365days-2025.07.17-19:40-gmjbuqyyqtovppizh3whew?verbose=false

Answer:

{
    "snapshots": [
        {
            "snapshot": "all-1hour-365days-2025.07.17-19:40-gmjbuqyyqtovppizh3whew",
            "uuid": "abbFdVX9Qp-Gr7cimlai6Q",
            "repository": "Snapshots_2",
            "indices": [
                ".ds-winlogbeat-siem-ds-2025.07.16-000818"
            ],
            "data_streams": [],
            "state": "SUCCESS"
        }
    ],
    "total": 1,
    "remaining": 0
}

b. we cannot restore, due to missing index:

Our POST:

/_snapshot/Snapshots_2/all-1hour-365days-2025.07.17-19:40-gmjbuqyyqtovppizh3whew/_restore

{

  "indices": ".ds-winlogbeat-siem-ds-2025.07.16-000818",

  "rename_pattern": "(.+)",

  "rename_replacement": "restored-$1"

}

Answer:

{
    "error": {
        "root_cause": [
            {
                "type": "snapshot_missing_exception",
                "reason": "[Snapshots_2:all-1hour-365days-2025.07.17-19:40-gmjbuqyyqtovppizh3whew/abbFdVX9Qp-Gr7cimlai6Q] is missing"
            }
        ],
        "type": "snapshot_missing_exception",
        "reason": "[Snapshots_2:all-1hour-365days-2025.07.17-19:40-gmjbuqyyqtovppizh3whew/abbFdVX9Qp-Gr7cimlai6Q] is missing",
        "caused_by": {
            "type": "no_such_file_exception",
            "reason": "/snapshots2/indices/A1EcVWcYSseOKpbBVdXzEg/meta-aVpiwJcBY7800yEqmipQ.dat"
        }
    },
    "status": 404
}

c. in fact, we cannot find the /snapshots2/indices/A1EcVWcYSseOKpbBVdXzEg/meta-aVpiwJcBY7800yEqmipQ.dat on disk. But we can see the folder /snapshots2/indices/A1EcVWcYSseOKpbBVdXzEg/, and /0 and /1 subfolders with some snap*.dat and different files.

Now we are looking forward on identifying:

  • What caused/is causing this behaviour?
  • Is there any way to restore this snapshots/indices?

Also important to mention, that this indices, are related to a datastream named winlogbeat-siem-ds, and due to time effort, we didnt yet _verify_repository, but we are starting it shortly.

Any kind of help might be precious, as we really might need to do future restores.

Looking forward to hear from you guys, thank you so much.

I don’t think anything in Elasticsearch could have caused this file to go missing, at least not without some other serious misadventure caused by external forces. Elasticsearch would delete this file when deleting the corresponding snapshot, but it does that after successfully updating the root index-${N} file to stop referring to the snapshot. Yet the snapshot exists in your root file, so either something else deleted the file, or else Elasticsearch deleted the snapshot (including this file) and then something reinstated it in index-${N} later on.

I don’t think so, sorry. The information in the lost file is vital, and does not exist elsewhere.

Since 8.16.0 Elasticsearch has had a verify repository integrity API which will scan the contents of your repository looking for problems like this. As the docs say:

If you suspect the integrity of the contents of one of your snapshot repositories, cease all write activity to this repository immediately, set its read_only option to true, and use this API to verify its integrity.

1 Like

Thank you so much for your fast answer!

Is it possible that the actions made between the point 4 to 6 could cause this behaviour? i see in the elasticsearch event logs that a cleanup did run. With ILM i meant *SML. Here is an elasticsearch clean up log event example:

index B7dNOnfWQL2KJ2uY3WGbqQ is no longer part of any snapshot in the repository, but failed to clean up its index folder java.nio.file.FileSystemException: /snapshots2/indices/B7dNOnfWQL2KJ2uY3WGbqQ/0/__sL4_QhykQT6O-LXagInLQw: Resource temporarily unavailable at sun.nio.fs.UnixException.translateToIOException(UnixException.java:100) ~[?:?] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)

It is important to note that the cleanup did run when the index-1581 was the one being “read“.

Also important to mention that this behaviour is vertical to every snapshots under 116750.

We have loads of snapshots containing loads of indices with this behaviour.

Related to:

We created a new cluster for that purpose, added this repo and set it to read-only, and started the _verify_integrity. Might take long for 500TB of repository :confused:

Additional question, Is there any easier/faster way to identify the “clean snapshots“ available for restoration? and is it possible to remove the info related to index-1581? (since it is not being read right now)

Thank you so much again!!

I don’t think so, no. Everything that Elasticsearch does to the repository will fail-safe: it only deletes files after it knows for sure that they are not needed. An error like Resource temporarily unavailable would halt the process.

No, sorry, you need to use the integrity verification API to determine this.

No, you must not ever modify the contents of a snapshot repository yourself.

1 Like

Thank you so much.

The _verify_integrity did finish, and we managed to be able to restore some indexes.

Some of them, we will not be able to, and it led us to a theory:

I would like to know if it does make any kind of sense, or might it be totally wrong:

In the past, we had issues with our Old TrueNas performance, and SMB was going down regularly down.

  1. During 07-16 an index named .ds-winlogbeat-siem-ds-2025.07.16-000817 been created at 00:00h. SMB was good during this whole day until 20:00h.

  2. The existing snapshots that mention th xxx817 index are able to restore it.

  3. During 07-16 an index named .ds-winlogbeat-siem-ds-2025.07.16-000818 been created at 21:00h (with smb down) and the snapshots the did run after got the following error:

    ElasticsearchException[failed to create blob container]; nested: FileSystemException[/snapshots2/indices/A1EcVWcYSseOKpbBVdXzEg: Host is down
    
  4. As mentioned earlier in this thread, the snapshot xxx818 is not able to be restored.

  5. During 07-17 we turned on SMB and the snapshots did run successfully, although it looks like the meta-aVpiwJcBY7800yEqmipQ.dat related to xxx818 never been “created“.

Our snapshots runs Hourly and snapshots 30days of indexes.

We believe that, we are only being able to restore indexes that been created when the SMB snapshots were fully functional. The ones that been created when the SMB was down, resulted in erros creating the blob, and when SMB been resolved, never did properly created its blobs.

If Elasticsearch got an error writing one of these files then it wouldn’t get as far as writing a root index-${N} file pointing to it.

To get the kinds of effects you’re reporting would need the storage to have incorrectly reported to Elasticsearch that it had created the desired file(s) successfully.

Thank you for your help.

Since there is only missing the:

/snapshots2/indices/{uuid}/meta-{uuid}.dat file,

but we still have de shards info under

  • /snapshots2/indices/{uuid}/0

  • /snapshots2/indices/{uuid}/1

Is there any possible work around to rebuild the index? or atleast to retreive the documents stored in the indexes? Elastic current version is 8.18.2

I don’t think that’s possible. The meta-{uuid}.dat contains the index metadata, without which the index data is meaningless.