I think elasticsearch-hadoop uses REST API to retrieve data. does "index.max_result_window", which is 10k by default, take effect in batch job mode? my spark job needs to retrieve and analyze all data in this case. the index could have 1M docs.
index.max_result_window only applies for regular queries to Elasticsearch, and is there mostly as a safeguard against problems that can occur when doing deep pagination. Instead of paginating data through the regular search API, ES-Hadoop uses the Scroll API which creates a longer lived search context and exports the results out of Elasticsearch over the course of multiple requests. In this case, the documents are sorted by their natural internal document order which does not require any sorting.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.