ElasticSearch for ETL - question about from-to requests and Scroll API

I am planning to use ElasticSearch for ETL. Airflow tasks are configured to read data from ElasticSearch in batches, process the data and load it back to Elastic.

The problem is that processing takes a lot of time and I split processing into several tasks which are ran in parallel.

I am using Elasticsearch-DSL for Python, and as far as I understand, when I use:

from elasticsearch_dsl import Search
s = Search(client=CLIENT, index=INDEX)
s = s[1000:1050]
results = s.execute()

It's as computationally expensive for ES to get this result, as to just get first 1050 objects.
Hence, the error I get -

Result window is too large, from + size must be less than or equal to: X but was Y

I've increased the max_result_window, but AFAIU it is not recommended (I need to set it to around 10 million)

Another approach that I found is using Scroll API. However, if I want to parallelize my tasks, the first task needs to get Scroll ID, pass it to the second task to get the next batch, then pass to the third, etc., which makes it difficult

Is there an efficient approach to use ElasticSearch for ETL, if I need to process all of the data in an index using several parallel workers?

Thank you

have you seen sliced scrolls, which seem to help in your case.

1 Like

Thank you, it seems like a good solution

However, I've failed to find this features in official Python libraries (elasticsearch, elasticsearch-dsl). Anyway, I could use API on low level to use scliced scroll

Thank you

you would just specify it in the body of a search like the query parameter, which isnt explicitely mentioned in the docs either, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.search

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.