What is the best approach for streaming large number of docs from ES 7.1v scrolling vs slicing?

Can you elaborate a bit more on the use case? How many documents are you retrieving? What do you do with this data? Are you indexing and/or updating concurrently? How frequently are you running this type of operation?

I am retrieving approx 1 million docs and currently i am doing nothing but just streaming the documents from ES, also I am getting variable number of docs each time i make a request and so i am not able to figure it out what i am doing wrong using the above code. Currently i am looking to get consistency in my results.

How are you intending to use this bulk retrieval?

The reason I am asking this is that Elasticsearch is a search engine optimized for fast retrieval of smaller result sets. Scroll is the right choice to return large data sets but as it locks segments in order to get a consistent view (impacts the ability to merge segments while a scroll is running) it is as the docs describe not suitable for real-time user requests.

Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

no i don't have issues with the real-time user requests according to my use case. Its just that i need to retrieve the existing docs in my cluster efficiently, without choking my cluster resources.

How large are your documents? What is the specification of your cluster? How many shards are you actively fetching from?

actually i am using parent child relationship and on the basis of some conditions that i apply on my child docs, those document that get qualified, i am retrieving parent docs of those child docs by using has_child query moreover my child docs are big(approx 40 fields) but parent docs are small(5-10).

How come you are using parent-child? Are the patent documents updated frequently?

Parent-child used more memory than flat documents but as I have never used them together with scroll queries I do not know how they interact.

no a document indexed once is never updated its just that the number of docs to be retrieved is big so i am issues related to heap and timeouts so i would request you to please help me with scrolling vs slicing.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.