Reindex using scroll api


(ciro) #1

Hi, The scope is that i want to reindex my document.

I have the last version of Elasticsearch, in the documentation i see that i need to use scroll api and bulk api (i know how to use them),

The first question is: Can i use "search_type=scan" on the first search or it is deprecated ?
i see this : https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_21_search_changes.html

The second question is: When i use scroll search wiht scroll id and i have less than 10 documents i have no results (ex. hits[] ) but if i have more than 10 doc i have results. How it's possible? There is some settings to change?

I'm doing a normal query:

POST /region/city/_search?scroll=1m
{
"size":1000,
"sort": ["_doc"],
"query" : {
"match_all" : {}
}
}


(Adrien Grand) #2

It is deprecated indeed. What you do with a scroll request that sorts on _doc is the right thing to do.

Maybe you are confused because a simple scroll already returns documents on the first request (on the contrary to scans which used to only count results on the first request and would only return hits on further calls to the scroll api).


(ciro) #3

ty for answer .

Now, i have a new question...if i use the scroll api or a normal search on my alias for a "match_all" query with the maximum size of the index...i have the same result...

So in this case the scroll api is useless becouse i can use a normal query on alias save the data then use the bulk api for insert them on the new index.

If i'm on the correct way....what is the utility to use the scroll api for reindex (recommended on the documentation of ES)?


(Adrien Grand) #4

The utility is that a regular search operation is very bad at fetching lots of records at once. It might work in your case if you don't have many documents, but otherwise the fact that it needs to fetch all matching documents and put them in a single json document will likely make your system go out of memory.


(ciro) #5

right, but for use scroll api i need to do a lot number of query for take all results, this can be slowed; and for use _search/scroll...before i need to do another query _search?scroll=1m for active the scroll_id (this isn't consistent for me).

So i have a large number of scroll query or a big json document and in both cases my system can go out of memory.

I forget something or it's right?


(system) #6