How to script export of > 10,000 records - 5 mil?


(zoplex) #1

I need to batch export of all record for one hour, one day, etc .. Often record counts are > 10,000 (Much more than that); I cannot change the 10,000 limit on size parameter for curl - and since this is automated - any other option except scroll approach? Scroll seems complex for batching/scripting ...

Thanks,


(Nik Everett) #2

There isn't any option other than scroll. I bet some of the language clients like, python, perl, or ruby have scroll helpers that'd make it simpler.


(Jörg Prante) #3

Why don't you use combination of from/size and page through the result set?


(zoplex) #4

It needs to run in batch file ... assuming what you are talking about is interactive?


(zoplex) #5

is it possible to script it ? Also I tried simple scroll call in ES 5 and got error:

curl -v -X GET 'localhost:9200/filebeat-2016.07.20/_search?search_type=scan&scroll=1m' -d '{ "query": { "match_all": {} }, "size": 100000 }'
...

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"No search type for [scan]"}],"type":"illegal_argument_exception","reason":"No search type for [scan]"},"status":400}root@node01:/app_2/query


(zoplex) #6

.. another problem is that documentation indicates that scroll does not sort data - so if data is needed sorted by let's say timestamp, then sorting would have to be done outside once the data is extracted:

https://www.elastic.co/guide/en/elasticsearch/guide/1.x/scan-scroll.html

...
"The costly part of deep pagination is the global sorting of results, but if we disable sorting, then we can return all documents quite cheaply. To do this, we use the scan search type. Scan instructs Elasticsearch to do no sorting, but to just return the next batch of results from every shard that still has results to return."


(David Pilato) #7

Read: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-request-scroll.html

Scroll requests have optimizations that make them faster when the sort order is _doc. If you want to iterate over all documents regardless of the order, this is the most efficient option:

curl -XGET 'localhost:9200/_search?scroll=1m' -d '
{
  "sort": [
    "_doc"
  ]
}
'

So you can scroll with any other sort criteria.

Note that you are not forced to use scan when you scroll. Scan has been removed in 2.x.


(zoplex) #8

Thank you all for the answers; I am also looking at changing the system parameter per suggestion from the co-worker:

curl -XPUT "http://localhost:9200/myindex/_settings "-d '{ "index" : { "max_result_window" : 500000 } }'

While this has some impact on ES - if large extracts are used only at night/say 1/day to get the data that is to be input into further analysis, it would be probably be manageable.

Thanks you


(Dilip Kumar) #9

Hi zoplex

You can use this to increase fetch documents

curl -XPUT http://localhost:9200/indexname/_settings -d '{ "index" : { "max_result_window" : 1000000}}'


(system) #10