How to parse snapshot .dat file?

I want to store all of the documents added to shards since the last snapshot to a database table and I don't want to restore an entire cluster for this. I've downloaded the snap-{uuid}.dat, meta-{uuid}.dat and relevant index files for the snapshot I want to get documents from, but I'm quite confused on how I can access the individual documents inside of the snap-{uuid}.dat? The file is unreadable and in some format that I can't currently parse. Does anyone know what format this is and how I may be able to parse this? Thanks

1 Like

Hi @Sheraz_Tariq

The snap-${uuid}.dat won't enable you to figure out the individual documents that were added for that snapshot I'm afraid (more on that below). However, if you want to learn more about the specific format of the snapshot repository files and how they work together you can do so by reading up on them here.

That said, it sounds to me like you want to restoring just specific indices instead of the whole cluster? You can easily do this via the restore API. See its documentation here.
In your specific case you will want to use a request like:

POST /_snapshot/my_backup/snapshot_1/_restore
{
  "indices": "index_1,index_2",
  "ignore_unavailable": true,
  "include_global_state": true,
  "rename_pattern": "index_(.+)",
  "rename_replacement": "restored_index_$1"
}

That allows you to clearly specify which indices to restore and to restore them to a different name if that helps you (via rename-pattern and - replacement). Technically, you could simply restore two snapshots using different rename replacements and then compare the resulting indices to get the exact difference between the two snapshots.

That should do what you're looking for right?

Thanks for your answer! I actually want all the indices that had any updates since the last snapshot sort of the set difference of the two snapshots, but I want to avoid restoring the two snapshots to clusters and comparing that way. Is there any way I can parse any of the files in the repository (such as the snap-{uuid}.dat, meta-{uuid}.dat and index files) and find the documents that were added/removed between two snapshots and output that to a text file? I can write up a simple script that can do that but I'm not sure where to get the actual documents that were added/deleted (incrementally updated) since the last snapshot. Are they present in the files in the repository on S3 and if so how can I parse these files, like what is the format of the .dat file (what does the header look like, so I can parse it byte by byte)?

Also if the documents are not in snap-{uuid}.dat file, then where are they located?

1 Like

The answer here is no, there is no way to do that, I'm afraid.

The format of these files is potentially compressed smile with Lucene headers and footers around them. The code we use to read these is here if you are interested. Technically you could work from it to create your own parser for the file but see below.

The difference between two snapshots is not directly in the snap-${uuid}.dat file. If you look at the documentation I linked you will find that the $snap-{uuid}.dat blob (together with the latest index-N blob) in the repository merely defines what shards can be found in a snapshot.
You can then go to the shard folders which themselves contain another snap-${uuid}.dat for each shard in the snapshot and contain a list of all the files that were added for the snapshot in each shard. The files referenced by the snap-?.dat and named as __${uuid} in the shards are the actual data files containing your documents.
The relationship between these individual files in a snapshot and the concrete documents they point at is not trivial though I'm afraid. The files are in Lucene format (documented here) and there is effectively no other way but restoring them to get their contents.

Maybe you could restore indices one by one (restoring both snapshots to different names) and then delete them again as you and compare contents that way to save resources?
There is in fact a shortcut that may help you here. Look at this API:

GET /_snapshot/my_backup/snapshot_1/_status

There is some sparse documentation for it here. But what it will return is JSON that contains the number of files added for each shard. Maybe, if you only added documents to a subset of indices you could identify those indices that have not changed as those with 0 in the file size/counts found the key incremental in all shards in the index to significantly simplify things because those with 0 changes won't need restoring? If you only added documents to a small number of indices this may be a viable path forward?

I see, this makes sense. I'm going to try and attempt to build a parser for the .dat files and see if that goes anywhere and then resort to restoring to cluster if that doesn't pan out.

On this line it says that the .dat files are SMILE serialized, so I'm wondering if I can use something like this to decode it? I've tried it and it gives me an invalid header error, so do you know what the header is for these files so I can remove that and go into the body? Really appreciate your help.

You can simply assume the header to be the first 18 bytes and skip that to get the SMILE out of these files. Again, that doesn't get you anything really as far as getting a hold of the actual data in the repository though.
Outside of the Lucene file names in each snapshot, literally all the information you will get from these files is also returned by

GET /_snapshot/my_backup/snapshot_1/_status

Ah I see, I'm going to try and extract a parser out of the elasticsearch code. I'm wondering if you can point to me to where in the code the actual documents (or segments) are being copied (or restored)? I've looked at the RestoreService and CcrRepository where a multiFileTransfer is started, but beyond that things seem to get hairy and I haven't been able to find exactly where the documents themselves are being restored to the cluster. I'm thinking of just printing those documents to the console and working from there.

This is happening here.

This is impossible I'm afraid. Elasticsearch/Lucene is not a row store. The documents are stored in an inverted index so you need to restore whole segments to get access to all the documents in them. The details of how things are stored are in the previously linked Lucene documentation and I don't think there is a technical way to build iteration over the difference between two snapshots that is not effectively restoring parts of both snapshots to disk before iterating.

@tanguy @Tanguy_Leroux Hey there! I saw your pull request here: https://github.com/elastic/elasticsearch/pull/49651 I'm wondering if you had any input on this?

Armin's reply above is absolutely correct. The only sensible way to find out the contents of a snapshot today is to restore it and search it. The relationship between the individual documents and the files stored by Lucene (documentation linked above) is very complicated. The work we're doing in the area of https://github.com/elastic/elasticsearch/issues/50999 may mean you will be able to avoid much of the work of the restore, but you will still need to search something to find the docs it contains.

@Sheraz_Tariq I agree with Armin and David here, the pull request you're pointing at won't really help you do what you want. You should consider using a different strategy to detect documents added to indices since the last time you export them to a database, maybe by rolling over indices every time their docs are fully exported and start indexing new docs in a new index, so that docs added since the last snapshot will always reside in a dedicated index.

1 Like

I see. One other question I had was that sometimes I'm able to make a GET request (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html) even though the document is not searchable yet. Where is the data stored during this time? Is it in transaction logs, or is there a document store somewhere?

And also generally speaking when someone makes a GET request, is the entire document re-built from the inverted index? Or is the document retrieved from somewhere?

Documents are first indexed in a Lucene index and then added to the transaction log. The transaction log is here in case of problems so that operations can be replayed if needed.

You can read more about the transaction log and Lucene index here: Translog | Elasticsearch Guide [8.11] | Elastic

Starting from 7.6.0 (see https://github.com/elastic/elasticsearch/pull/48843), GET requests are served from the translog if the request has the realtime parameter set to true and if the translog contains the document. The transaction log contains the full source and is trimmed automatically by Elasticsearch. If the translog does not contain the doc, the request is served from the Lucene index (and an internal refresh might be required in this case). The Lucene index possibly contain the full source of the document (in the _source field if storing the source is enabled), otherwise it will returns the fields that are explicitly marked as stored in the mapping

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.