I think elasticsearch-hadoop uses REST API to retrieve data. does "index.max_result_window", which is 10k by default, take effect in batch job mode? my spark job needs to retrieve and analyze all data in this case. the index could have 1M docs.
index.max_result_window only applies for regular queries to Elasticsearch, and is there mostly as a safeguard against problems that can occur when doing deep pagination. Instead of paginating data through the regular search API, ES-Hadoop uses the Scroll API which creates a longer lived search context and exports the results out of Elasticsearch over the course of multiple requests. In this case, the documents are sorted by their natural internal document order which does not require any sorting.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.