Read segments directly from the snapshot

Ariel_Zach · June 2, 2023, 10:18pm

Hello,

I'm trying to read Lucene files (segments) directly from the files generated in the snapshot, but I'm encountering errors. Could you guide me on how to do it?

Thank you.

dadoonet · June 3, 2023, 2:52am

Why do you want to do such a thing? It's definitely not supported and you should only use the Elasticsearch APIs.

Ariel_Zach · June 3, 2023, 2:45pm

Hello David,

We currently have clusters with very large indices and high traffic. Reindexing is not a viable option as it performs poorly in these cases. It slows down over time, stresses the cluster, and is frequently canceled. Since each shard of the index is a file in the snapshot, Since each shard of the index is a file in the snapshot, the best approach is to read from there and insert the documents into a new index. I tried using the Lucene library, but I didn't succeed.

DavidTurner · June 3, 2023, 3:13pm

The isn't the case. Each shard is made up of many files in the snapshot repository.

This is what reindex does, so I don't think it'll work any better.

I'm not surprised, the only realistic way to read data from a snapshot repository is by restoring it into Elasticsearch.

Ariel_Zach · June 3, 2023, 4:04pm

The isn't the case. Each shard is made up of many files in the snapshot repository

Where can I find this internal structure? Is it documented somewhere?

This is what reindex does, so I don't think it'll work any better.

That's true, it does work, but in a 40TB index, it's useless. As I mentioned before, in a high-traffic cluster, the reindexing process gets canceled due to errors and is slow.

I'm not surprised, the only realistic way to read data from a snapshot repository is by restoring it into Elasticsearch.

"Realistic way," if Elasticsearch can do it, anyone can do it. That's why I'm asking for help here.

Christian_Dahlqvist · June 3, 2023, 4:46pm

What is it you are looking to achieve by reindexing?

What type of load is the cluster under? Is it only queries or a mix of queries, inserts and updates?

How many shards does this index have? What is the average shard size?

Ariel_Zach · June 3, 2023, 5:46pm

Hi Christian,

Cluster information:

150k RPM reads
200k RPM writes
It's a mix of queries, inserts, and updates.
32 shards, each with 1.9TB, a total of 15 billion documents.

What is it you are looking to achieve by reindexing?

The index is already too large, and we want to split it into multiple chunks.

Christian_Dahlqvist · June 3, 2023, 6:48pm

Reindexing is going to take a long time whichever approach you take, and given that you are having a lot of inserts and updates it might be be difficult to ensure consistency across the two indices.

Have you looked into using the split index API? This should be a lot quicker and require less downtime. I do believe it does require a good amount of extra disk space though, but it would be worth considering.

Ariel_Zach · June 3, 2023, 7:25pm

That's something we took into account. The problem is that the index was created with version 6.8 and the split feature cannot be used.

DavidTurner · June 4, 2023, 7:42am

Yes, although it's pretty complex and not at all focussed on this kind of usage - see org.elasticsearch.repositories.blobstore package summary - elasticsearch 8.11.1 javadoc (NB varies by version). And all the classes to which those docs link. But then you'll likely need various other customisations to Lucene to actually open these indices outside of ES, and I don't think this is documented well.

You're proposing doing the work in a separate process, isolated from this cluster, which seems like a reasonable workaround, but why not have this separate process be another Elasticsearch cluster dedicated to the reindexing work?

It's true that anyone can in theory duplicate all the mechanisms in ES needed to read a snapshot, but I think it's an unrealistic amount of effort to do so.

Bob_Penrod · June 9, 2023, 10:46pm

Back when our app used Lucene embedded in it, we used GitHub - DmitryKey/luke: This is mavenised Luke: Lucene Toolbox Project to inspect the Lucene files directly. It's evidently become an official Lucene module now. Reading TBs sounds like a lot of work. Good luck!

system · July 7, 2023, 10:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to view documents by lucene segment? Elasticsearch	3	250	May 9, 2023
Is it possible to update snapshot with new lucene segment and restore Elasticsearch	3	387	July 6, 2017
Incremental Snapshot details? Elasticsearch	6	1774	September 12, 2017
Read data from .fdx and .fdt files Elasticsearch	4	3528	July 6, 2017
Read ES index snapshot in spark without restore Elasticsearch es-hadoop	2	1508	September 1, 2017

Read segments directly from the snapshot

Related topics