ElasticSearch for ETL - question about from-to requests and Scroll API

KindYAK · August 28, 2019, 6:10am

I am planning to use Elasticsearch for ETL. Airflow tasks are configured to read data from Elasticsearch in batches, process the data and load it back to Elastic.

The problem is that processing takes a lot of time and I split processing into several tasks which are ran in parallel.

I am using Elasticsearch-DSL for Python, and as far as I understand, when I use:

from elasticsearch_dsl import Search
s = Search(client=CLIENT, index=INDEX)
s = s[1000:1050]
results = s.execute()

It's as computationally expensive for ES to get this result, as to just get first 1050 objects.
Hence, the error I get -

Result window is too large, from + size must be less than or equal to: X but was Y

I've increased the max_result_window, but AFAIU it is not recommended (I need to set it to around 10 million)

Another approach that I found is using Scroll API. However, if I want to parallelize my tasks, the first task needs to get Scroll ID, pass it to the second task to get the next batch, then pass to the third, etc., which makes it difficult

Is there an efficient approach to use Elasticsearch for ETL, if I need to process all of the data in an index using several parallel workers?

Thank you

spinscale · August 28, 2019, 9:51am

have you seen sliced scrolls, which seem to help in your case.

KindYAK · August 29, 2019, 4:31am

Thank you, it seems like a good solution

However, I've failed to find this features in official Python libraries (elasticsearch, elasticsearch-dsl). Anyway, I could use API on low level to use scliced scroll

Thank you

spinscale · August 29, 2019, 7:26am

you would just specify it in the body of a search like the query parameter, which isnt explicitely mentioned in the docs either, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.search

system · September 26, 2019, 7:27am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to use elasticsearch sliced scroll with multiprocessing in python? Elasticsearch language-clients	2	2891	September 30, 2021
Result window is too large, from + size must be less than or equal to: [10000] but was [11001] Elasticsearch	5	15463	July 5, 2017
Simultaneously executing multiple queries on Scroll API to fetch Large Data Elasticsearch	17	3294	August 7, 2020
Get all documents from an index Elasticsearch	10	107220	June 21, 2017
Achieving Pagination in Elasticsearch using Scroll Elasticsearch	3	1377	March 20, 2017

ElasticSearch for ETL - question about from-to requests and Scroll API

Related topics