How to script export of > 10,000 records - 5 mil?

zoplex · July 20, 2016, 9:19pm

I need to batch export of all record for one hour, one day, etc .. Often record counts are > 10,000 (Much more than that); I cannot change the 10,000 limit on size parameter for curl - and since this is automated - any other option except scroll approach? Scroll seems complex for batching/scripting ...

Thanks,

nik9000 · July 20, 2016, 9:37pm

There isn't any option other than scroll. I bet some of the language clients like, python, perl, or ruby have scroll helpers that'd make it simpler.

jprante · July 20, 2016, 10:07pm

Why don't you use combination of from/size and page through the result set?

zoplex · July 20, 2016, 10:58pm

It needs to run in batch file ... assuming what you are talking about is interactive?

zoplex · July 20, 2016, 11:43pm

is it possible to script it ? Also I tried simple scroll call in ES 5 and got error:

curl -v -X GET 'localhost:9200/filebeat-2016.07.20/_search?search_type=scan&scroll=1m' -d '{ "query": { "match_all": {} }, "size": 100000 }'
...

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"No search type for [scan]"}],"type":"illegal_argument_exception","reason":"No search type for [scan]"},"status":400}root@node01:/app_2/query

zoplex · July 21, 2016, 1:56am

.. another problem is that documentation indicates that scroll does not sort data - so if data is needed sorted by let's say timestamp, then sorting would have to be done outside once the data is extracted:

https://www.elastic.co/guide/en/elasticsearch/guide/1.x/scan-scroll.html

...
"The costly part of deep pagination is the global sorting of results, but if we disable sorting, then we can return all documents quite cheaply. To do this, we use the scan search type. Scan instructs Elasticsearch to do no sorting, but to just return the next batch of results from every shard that still has results to return."

dadoonet · July 21, 2016, 6:13am

Read: Scroll | Elasticsearch Guide [2.3] | Elastic

Scroll requests have optimizations that make them faster when the sort order is _doc. If you want to iterate over all documents regardless of the order, this is the most efficient option:

curl -XGET 'localhost:9200/_search?scroll=1m' -d '
{
  "sort": [
    "_doc"
  ]
}
'

So you can scroll with any other sort criteria.

Note that you are not forced to use scan when you scroll. Scan has been removed in 2.x.

zoplex · July 28, 2016, 5:13pm

Thank you all for the answers; I am also looking at changing the system parameter per suggestion from the co-worker:

curl -XPUT "http://localhost:9200/myindex/_settings "-d '{ "index" : { "max_result_window" : 500000 } }'

While this has some impact on ES - if large extracts are used only at night/say 1/day to get the data that is to be input into further analysis, it would be probably be manageable.

Thanks you

Dilip_Kumar · February 13, 2017, 10:17am

Hi zoplex

You can use this to increase fetch documents

curl -XPUT http://localhost:9200/indexname/_settings -d '{ "index" : { "max_result_window" : 1000000}}'

Topic		Replies	Views
How to use pagination per batch Elasticsearch	5	774	February 26, 2020
Increase size limit - Python ElasticSearch Elasticsearch language-clients	8	1071	March 22, 2021
Get all documents from an index Elasticsearch	10	110209	June 21, 2017
Scrolling / sorting Elasticsearch	6	3325	July 6, 2017
Elasticsearch Bulk Write is slow using Scan and Scroll Elasticsearch	4	927	July 5, 2017

How to script export of > 10,000 records - 5 mil?

Related topics