Read segments directly from the snapshot

Hello,

I'm trying to read Lucene files (segments) directly from the files generated in the snapshot, but I'm encountering errors. Could you guide me on how to do it?

Thank you.

Why do you want to do such a thing? It's definitely not supported and you should only use the Elasticsearch APIs.

Hello David,

We currently have clusters with very large indices and high traffic. Reindexing is not a viable option as it performs poorly in these cases. It slows down over time, stresses the cluster, and is frequently canceled. Since each shard of the index is a file in the snapshot, Since each shard of the index is a file in the snapshot, the best approach is to read from there and insert the documents into a new index. I tried using the Lucene library, but I didn't succeed.

The isn't the case. Each shard is made up of many files in the snapshot repository.

This is what reindex does, so I don't think it'll work any better.

I'm not surprised, the only realistic way to read data from a snapshot repository is by restoring it into Elasticsearch.

The isn't the case. Each shard is made up of many files in the snapshot repository

Where can I find this internal structure? Is it documented somewhere?

This is what reindex does, so I don't think it'll work any better.

That's true, it does work, but in a 40TB index, it's useless. As I mentioned before, in a high-traffic cluster, the reindexing process gets canceled due to errors and is slow.

I'm not surprised, the only realistic way to read data from a snapshot repository is by restoring it into Elasticsearch.

"Realistic way," if Elasticsearch can do it, anyone can do it. That's why I'm asking for help here.

What is it you are looking to achieve by reindexing?

What type of load is the cluster under? Is it only queries or a mix of queries, inserts and updates?

How many shards does this index have? What is the average shard size?

Hi Christian,

Cluster information:

  • 150k RPM reads
  • 200k RPM writes
  • It's a mix of queries, inserts, and updates.
  • 32 shards, each with 1.9TB, a total of 15 billion documents.

What is it you are looking to achieve by reindexing?

The index is already too large, and we want to split it into multiple chunks.

Reindexing is going to take a long time whichever approach you take, and given that you are having a lot of inserts and updates it might be be difficult to ensure consistency across the two indices.

Have you looked into using the split index API? This should be a lot quicker and require less downtime. I do believe it does require a good amount of extra disk space though, but it would be worth considering.

That's something we took into account. The problem is that the index was created with version 6.8 and the split feature cannot be used.

Yes, although it's pretty complex and not at all focussed on this kind of usage - see org.elasticsearch.repositories.blobstore package summary - elasticsearch 8.11.1 javadoc (NB varies by version). And all the classes to which those docs link. But then you'll likely need various other customisations to Lucene to actually open these indices outside of ES, and I don't think this is documented well.

You're proposing doing the work in a separate process, isolated from this cluster, which seems like a reasonable workaround, but why not have this separate process be another Elasticsearch cluster dedicated to the reindexing work?

It's true that anyone can in theory duplicate all the mechanisms in ES needed to read a snapshot, but I think it's an unrealistic amount of effort to do so.

1 Like

Back when our app used Lucene embedded in it, we used GitHub - DmitryKey/luke: This is mavenised Luke: Lucene Toolbox Project to inspect the Lucene files directly. It's evidently become an official Lucene module now. Reading TBs sounds like a lot of work. Good luck!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.