How to parse smile serialized files (files with .dat) (files with snap-* and meta-* prefixes)

When Elastic search snapshot deletes fail in between, some dangling indexes are left, which increase the size of the repository.

I wanted to find such indexes by referring to the snapshot metadata files. I need help in parsing these files as they are SMILE serialized, and the jackson java library parser does not work with it.

Any help would be greatly appreciated

If a failure leaves any dangling data then it is cleaned up the next time a snapshot is deleted. If you don't regularly delete any snapshots you can trigger the cleanup process manually.

It should be possible to read the SMILE-formatted portion of the metadata files with Jackson or similar, although the format does change from version to version so it's not so simple. But what would you do with the contents once you've read it? You shouldn't be deleting any blobs from the repository yourself.

Hey!
I have tried to run the cleanup process manually, it still leaves huge discrepancies in the size.
For example i have a cluster that reports 2Tb size, while the repository backup for it still shows ~10 Tb in size. I can't think of any other possible reasons why this occurs.

After i've read the smile files, i'd check for any unreferenced index files, and delete them (or mark it as extra).

I couldn't read the files through jackson, finally went through the enitre codebase and used the deserialize method in ChecksumBlobStoreFormat class.

But i couldn't figure out what the namedXContentRegistry parameter is for, and just copied it from a code segment. could you please help me in understanding what this parameter is exactly?

this is the current namedXContentRegistry i am using

private static final NamedXContentRegistry xContentRegistry = new NamedXContentRegistry(Stream.of(
            NetworkModule.getNamedXContents().stream(),
            IndicesModule.getNamedXContents().stream(),
            ClusterModule.getNamedXWriteables().stream())
            .flatMap(Function.identity()).collect(toList())
            );

Will this work for all the metadata and snap files used in the snapshots?

I recommend against doing that, if you get even one file wrong then you could render the whole repository unreadable. If you find that the cleanup process isn't cleaning everything up then please report it as a bug.

It's for plugins to add their own SMILE-serialized types, but I don't know that we use any such things in snapshot metadata so NamedXContentRegistry.EMPTY should be ok I believe.

Blockquote
I recommend against doing that, if you get even one file wrong then you could render the whole repository unreadable. If you find that the cleanup process isn't cleaning everything up then please report it as a bug.

The code that i am working on entails that i must do this. I will surely report this as a bug.

Blockquote
It's for plugins to add their own SMILE-serialized types, but I don't know that we use any such things in snapshot metadata so NamedXContentRegistry.EMPTY should be ok I believe.

this is the 7.13 deserialize function

public T deserialize(String blobName, NamedXContentRegistry namedXContentRegistry, BytesReference bytes){

will the NamedXContentRegistry.EMPTY work for the repository s3 plugin as well?

I think so. It's worth trying, you'll get an exception if it didn't work.

If you do find a bug btw, it would help if you could share a copy of all the metadata blobs in your repo, plus a listing of the whole repo, before you've attempted any fixes. I can arrange a channel for you to share this info with me privately if needed.

Alright, I'll get back to you. Please allow me some time to confirm whether it's a bug or something else.

Thanks for the all the help. :slight_smile:

Hey @DavidTurner !
Turns out it isn't a bug. The cleanup script just doesn't clean enough.

I did write a program for an enhanced version of cleanup, which i like to call, deepclean xD, and needed some help in converting it to a plugin.

So i wanted to register an api like

post deepclean
{
    "somefield": "somevalue"
    "anotherfield": "anothervalue"
}

i have currently created a catAction, but i can't read the attached json data
can you help me with this?

I think that counts as a bug. Rather than a new cleanup API, let's fix the one that's already there. Can you help us understand what it's missing?

Can you help me with json data first? Really sorry for this, but i am on a time crunch :sweat_smile: .
It'll be really helpful to me.

Interesting, thanks for noting that. I don't have time to try and reproduce it right now but I also can't see a path through the cleanup code that would clean up such blobs. I opened an issue anyway, we'll get to it at some point, unless you open a PR to fix it first.

It's surely much simpler to just fix the existing API?

i need it for a custom project, and i need to build it in a week. so i really need to write my own plugin, also it contains some other custom things.

Blockquote
Interesting, thanks for noting that. I don't have time to try and reproduce it right now but I also can't see a path through the cleanup code that would clean up such blobs. I opened an issue anyway, we'll get to it at some point, unless you open a PR to fix it first.

Aw, man!. I really wanted to fix this issue with a PR, it would've been my first.
Is it possible to close the issue, i'll really get working on it in a month or so :sweat_smile:

if you have time, i'd really love some input regarding how to read the json data

Sure, sorry, I meant that we likely won't be working on this in the near future, if you want to contribute then go ahead. Drop a message on the issue to say you're working on it. We won't close it until it's fixed or we decide we won't fix it.

Okay! I'll get working on it.

Can you please help me regarding the json data?

It's unlikely that I'm the right person to help with that, sorry. I suggest opening a new thread with a more precise description of the problem you're having.