I am planning to use ElasticSearch for ETL. Airflow tasks are configured to read data from ElasticSearch in batches, process the data and load it back to Elastic.
The problem is that processing takes a lot of time and I split processing into several tasks which are ran in parallel.
I am using Elasticsearch-DSL for Python, and as far as I understand, when I use:
from elasticsearch_dsl import Search
s = Search(client=CLIENT, index=INDEX)
s = s[1000:1050]
results = s.execute()
It's as computationally expensive for ES to get this result, as to just get first 1050 objects.
Hence, the error I get -
Result window is too large, from + size must be less than or equal to: X but was Y
I've increased the max_result_window, but AFAIU it is not recommended (I need to set it to around 10 million)
Another approach that I found is using Scroll API. However, if I want to parallelize my tasks, the first task needs to get Scroll ID, pass it to the second task to get the next batch, then pass to the third, etc., which makes it difficult
Is there an efficient approach to use ElasticSearch for ETL, if I need to process all of the data in an index using several parallel workers?