What is the best approach for streaming large number of docs from ES 7.1v scrolling vs slicing?

yash.tandon · December 2, 2019, 6:18am

Christian_Dahlqvist · December 2, 2019, 6:30am

Can you elaborate a bit more on the use case? How many documents are you retrieving? What do you do with this data? Are you indexing and/or updating concurrently? How frequently are you running this type of operation?

yash.tandon · December 2, 2019, 6:38am

I am retrieving approx 1 million docs and currently i am doing nothing but just streaming the documents from ES, also I am getting variable number of docs each time i make a request and so i am not able to figure it out what i am doing wrong using the above code. Currently i am looking to get consistency in my results.

Christian_Dahlqvist · December 2, 2019, 6:40am

How are you intending to use this bulk retrieval?

The reason I am asking this is that Elasticsearch is a search engine optimized for fast retrieval of smaller result sets. Scroll is the right choice to return large data sets but as it locks segments in order to get a consistent view (impacts the ability to merge segments while a scroll is running) it is as the docs describe not suitable for real-time user requests.

Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

yash.tandon · December 2, 2019, 6:48am

no i don't have issues with the real-time user requests according to my use case. Its just that i need to retrieve the existing docs in my cluster efficiently, without choking my cluster resources.

Christian_Dahlqvist · December 2, 2019, 6:51am

How large are your documents? What is the specification of your cluster? How many shards are you actively fetching from?

yash.tandon · December 2, 2019, 6:55am

actually i am using parent child relationship and on the basis of some conditions that i apply on my child docs, those document that get qualified, i am retrieving parent docs of those child docs by using has_child query moreover my child docs are big(approx 40 fields) but parent docs are small(5-10).

Christian_Dahlqvist · December 2, 2019, 7:53am

How come you are using parent-child? Are the patent documents updated frequently?

Parent-child used more memory than flat documents but as I have never used them together with scroll queries I do not know how they interact.

yash.tandon · December 2, 2019, 9:40am

no a document indexed once is never updated its just that the number of docs to be retrieved is big so i am issues related to heap and timeouts so i would request you to please help me with scrolling vs slicing.

system · December 30, 2019, 9:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Real time scrolling solution Elasticsearch	23	2048	August 14, 2018
Scrolling or slicing? Elasticsearch	5	1787	April 27, 2017
Infinite scroll best practices with ES Elasticsearch	4	7117	July 6, 2017
How to fetch ~12M documents(may be even more) quickly from ES using scroll API? Elasticsearch	4	853	December 28, 2017
What's the quickest way to extract a LARGE amount of records out of ES? Best practices for scroll API are welcome Elasticsearch	2	3045	July 5, 2017

What is the best approach for streaming large number of docs from ES 7.1v scrolling vs slicing?

Related topics