How to parse smile serialized files (files with .dat) (files with snap-* and meta-* prefixes)

shadowfox38 · July 14, 2021, 7:30am

When Elastic search snapshot deletes fail in between, some dangling indexes are left, which increase the size of the repository.

I wanted to find such indexes by referring to the snapshot metadata files. I need help in parsing these files as they are SMILE serialized, and the jackson java library parser does not work with it.

Any help would be greatly appreciated

DavidTurner · July 14, 2021, 7:49am

If a failure leaves any dangling data then it is cleaned up the next time a snapshot is deleted. If you don't regularly delete any snapshots you can trigger the cleanup process manually.

It should be possible to read the SMILE-formatted portion of the metadata files with Jackson or similar, although the format does change from version to version so it's not so simple. But what would you do with the contents once you've read it? You shouldn't be deleting any blobs from the repository yourself.

shadowfox38 · July 14, 2021, 7:58am

Hey!
I have tried to run the cleanup process manually, it still leaves huge discrepancies in the size.
For example i have a cluster that reports 2Tb size, while the repository backup for it still shows ~10 Tb in size. I can't think of any other possible reasons why this occurs.

After i've read the smile files, i'd check for any unreferenced index files, and delete them (or mark it as extra).

I couldn't read the files through jackson, finally went through the enitre codebase and used the deserialize method in ChecksumBlobStoreFormat class.

But i couldn't figure out what the namedXContentRegistry parameter is for, and just copied it from a code segment. could you please help me in understanding what this parameter is exactly?

this is the current namedXContentRegistry i am using

private static final NamedXContentRegistry xContentRegistry = new NamedXContentRegistry(Stream.of(
            NetworkModule.getNamedXContents().stream(),
            IndicesModule.getNamedXContents().stream(),
            ClusterModule.getNamedXWriteables().stream())
            .flatMap(Function.identity()).collect(toList())
            );

Will this work for all the metadata and snap files used in the snapshots?

DavidTurner · July 14, 2021, 8:05am

I recommend against doing that, if you get even one file wrong then you could render the whole repository unreadable. If you find that the cleanup process isn't cleaning everything up then please report it as a bug.

It's for plugins to add their own SMILE-serialized types, but I don't know that we use any such things in snapshot metadata so NamedXContentRegistry.EMPTY should be ok I believe.

shadowfox38 · July 14, 2021, 8:10am

Blockquote
I recommend against doing that, if you get even one file wrong then you could render the whole repository unreadable. If you find that the cleanup process isn't cleaning everything up then please report it as a bug.

The code that i am working on entails that i must do this. I will surely report this as a bug.

Blockquote
It's for plugins to add their own SMILE-serialized types, but I don't know that we use any such things in snapshot metadata so NamedXContentRegistry.EMPTY should be ok I believe.

this is the 7.13 deserialize function

public T deserialize(String blobName, NamedXContentRegistry namedXContentRegistry, BytesReference bytes){

will the NamedXContentRegistry.EMPTY work for the repository s3 plugin as well?

DavidTurner · July 14, 2021, 8:14am

I think so. It's worth trying, you'll get an exception if it didn't work.

DavidTurner · July 14, 2021, 8:16am

If you do find a bug btw, it would help if you could share a copy of all the metadata blobs in your repo, plus a listing of the whole repo, before you've attempted any fixes. I can arrange a channel for you to share this info with me privately if needed.

shadowfox38 · July 14, 2021, 8:20am

Alright, I'll get back to you. Please allow me some time to confirm whether it's a bug or something else.

Thanks for the all the help.

shadowfox38 · July 22, 2021, 10:46am

Hey @DavidTurner !
Turns out it isn't a bug. The cleanup script just doesn't clean enough.

I did write a program for an enhanced version of cleanup, which i like to call, deepclean xD, and needed some help in converting it to a plugin.

So i wanted to register an api like

post deepclean
{
    "somefield": "somevalue"
    "anotherfield": "anothervalue"
}

i have currently created a catAction, but i can't read the attached json data
can you help me with this?

DavidTurner · July 22, 2021, 10:56am

I think that counts as a bug. Rather than a new cleanup API, let's fix the one that's already there. Can you help us understand what it's missing?

shadowfox38 · July 22, 2021, 11:02am

Can you help me with json data first? Really sorry for this, but i am on a time crunch .
It'll be really helpful to me.

DavidTurner · July 22, 2021, 11:24am

Interesting, thanks for noting that. I don't have time to try and reproduce it right now but I also can't see a path through the cleanup code that would clean up such blobs. I opened an issue anyway, we'll get to it at some point, unless you open a PR to fix it first.

github.com/elastic/elasticsearch

Repository cleanup may leave dangling shard-level blobs

opened 11:21AM - 22 Jul 21 UTC

DaveCTurner

>bug :Distributed/Snapshot/Restore Team:Distributed

If we end up leaving dangling blobs in a shard-level folder in a repository then… it seems we won't clean them up until we delete a snapshot that references that shard. In particular the repository cleanup API apparently only looks for dangling things at the top level, and will miss dangling shard-level blobs if the index in question is still referenced by at least one snapshot. Reported by [a user in the forums](https://discuss.elastic.co/t/how-to-parse-smile-serialized-files-files-with-dat-files-with-snap-and-meta-prefixes/278621/11?u=davidturner). I took a quick look at the code and it seems plausible, although I haven't dug very deeply.

It's surely much simpler to just fix the existing API?

shadowfox38 · July 22, 2021, 11:26am

i need it for a custom project, and i need to build it in a week. so i really need to write my own plugin, also it contains some other custom things.

shadowfox38 · July 22, 2021, 11:27am

Blockquote
Interesting, thanks for noting that. I don't have time to try and reproduce it right now but I also can't see a path through the cleanup code that would clean up such blobs. I opened an issue anyway, we'll get to it at some point, unless you open a PR to fix it first.

Aw, man!. I really wanted to fix this issue with a PR, it would've been my first.
Is it possible to close the issue, i'll really get working on it in a month or so

shadowfox38 · July 22, 2021, 11:32am

if you have time, i'd really love some input regarding how to read the json data

DavidTurner · July 22, 2021, 11:40am

Sure, sorry, I meant that we likely won't be working on this in the near future, if you want to contribute then go ahead. Drop a message on the issue to say you're working on it. We won't close it until it's fixed or we decide we won't fix it.

shadowfox38 · July 22, 2021, 11:42am

Okay! I'll get working on it.

Can you please help me regarding the json data?

DavidTurner · July 22, 2021, 11:54am

It's unlikely that I'm the right person to help with that, sorry. I suggest opening a new thread with a more precise description of the problem you're having.

system · August 19, 2021, 11:55am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch and Smile encoded JSON Elasticsearch	5	2484	July 6, 2017
Shared File System - Repository Issues Elasticsearch snapshot-and-restore	7	63	August 20, 2025
Snapshots don't get compressed Elasticsearch	8	1437	July 6, 2017
Bug Report For ElasticSearch 0.15.0 ( JsonParseException: Invalid shared name reference ) Elasticsearch	7	320	July 6, 2017
Benefits of storing JSON documents in binary Smile format Elasticsearch	4	2548	July 6, 2017

How to parse smile serialized files (files with .dat) (files with snap-* and meta-* prefixes)

Related topics