I need to batch export of all record for one hour, one day, etc .. Often record counts are > 10,000 (Much more than that); I cannot change the 10,000 limit on size parameter for curl - and since this is automated - any other option except scroll approach? Scroll seems complex for batching/scripting ...
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"No search type for [scan]"}],"type":"illegal_argument_exception","reason":"No search type for [scan]"},"status":400}root@node01:/app_2/query
.. another problem is that documentation indicates that scroll does not sort data - so if data is needed sorted by let's say timestamp, then sorting would have to be done outside once the data is extracted:
...
"The costly part of deep pagination is the global sorting of results, but if we disable sorting, then we can return all documents quite cheaply. To do this, we use the scan search type. Scan instructs Elasticsearch to do no sorting, but to just return the next batch of results from every shard that still has results to return."
Scroll requests have optimizations that make them faster when the sort order is _doc. If you want to iterate over all documents regardless of the order, this is the most efficient option:
While this has some impact on ES - if large extracts are used only at night/say 1/day to get the data that is to be input into further analysis, it would be probably be manageable.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.