Retrieving millions of large documents

Hello everyone, first time i am requesting your help.

I am working with elastic version 8.5.3 with java client 7.17.1, let me represent you with the problem I'm having.

I have daily indices with the largest of them holding unique data documents, the amount of documents I have are in the millions (about 15 million a day), with size of the documents reaching to tens of Gigabyte's.

For my use case the documents are timed and I need to aquire between times (sort of a recording and reviewing upon request sort of use case), I know this isn't the intended use of elastic but it's what we've got.

Untill now I have used the pagination process in order to retrieve all the records, and it worked fine untill the data set became too large, and because the web client is using the ping pong method (scroll and return) it can't really be multi threaded or splitting it to multiple identical services with load balancing.

I need to find a way to make the process quicker, there is also post processing after the retrieval of the documents.

For starters I have reduced the size of the documents and it helped but I fear that it wouldn't be enough.

Thanks in advance

Sliced search is there to let you parallelize the retrieval of a large data set, sounds like you want that.

Is the PIT id the same as the scroll_id concept?

Similar, yes.

If you have access to a hadoop or spark cluster, es-hadoop is another option..

I have a sort of follow up question for better understanding, while I'm not new to elastic I noticed something that puzzled me today that I have yet to notice.

As my explanation goes my client does a request that generates an original scroll_id, I then return this scroll I'd to the client and then he sends a different request which performs the scroll (which returns the results and the next scroll_id)

What seemed weird to me today is that I noticed that the scroll id remained the same, it worked as it should but I wondered if the scroll id should change or each query updates the "pointer" of the scroll id?

Just wondering.

No, it will (often) remain unchanged. But don't rely on that, treat it as different each time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.