How to index docs using Scan and Scroll


(ganeshbabu) #1

Hi All,

I have gone through scan & scroll documentation in elasticsearch where we can able to copy docs from one index to another index and I tried to do that by the following command.

POST /es_item/_search?scroll=10m&search_type=scan
{
"query": { "match_all": {}}
}

I got the following response in sense

{
"_scroll_id": "c2Nhbjs1OzE0Okp2WmZocUZLU2N2t0ck0tNV9UUEE7MTA6ODJqSEl4X3BTRENISVAwSE8xRzQtdzsxNjpKdlpmaHFGS1NjT09rdHJNLTVfVFBBOzExOjgyakhJeF9wU0RDSElQMEhPMUc0LXc7MTU6SnZaZmhxRktTY09Pa3RyTS01X1RQQTsxO3RvdGFsX2hpdHM6MTIxNjU2Ow==",
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 121656,
"max_score": 0,
"hits": []
}
}

When I use GET command

GET _search/scroll?scroll=1m&scroll_id=c2Nhbjs1Ozk6SnZaZmhxRktTY09Pa3RyTS01X1RQQTs2OjgyakhJeF9wU0RDSElQMEhPMUc0LXc7ODpKdlpmaGS1NjT09rdHJNLTVfVFBBOzc6ODJqSEl4X3BTRENISVAwSE8xRzQtdzsxMDpKdlpmaHFGS1NjT09rdHJNLTVfVFBBOzE7dG90YWxfaGl0czoxMjE2NTY7

I can able to see the list of items

sample output:-

{
"_scroll_id": "c2Nhbjs1OzE3Okp2WmZocUZLU2NPT2t0ck0tNV9UUEE7MTI6ODJqSEl4X3BTRENISVAwSE8xRzQtdzsxOTpKdlpmaHFGS1NjT09rdHJNLTVfVFBBOzEzOjgyakhJeF9wU0RDSElQMEhPMUc0LXc7MTg6SnZaZmhxRktTY09Pa3RyTS01X1RQQTsxO3RvdGFsX2hpdHM6MTIxNjU2Ow==",
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 121656,
"max_score": 0,
"hits": [
{
"_index": "es_item",
"_type": "item",
"_id": "16636281",
"_score": 0,
"_source": {
"ITEM_ID": 16636281,
"ITEM_CODE": "16636281",
"ITEM_DSCR": null,
"ITEM_SPECIFICITY_REF_ID": 184,
"ITEM_TYPE": "STANDARD ITEM",
}
}]
}}

Can you give some detailed info or sample code of how to index docs by using the scroll id?

It would be very helpful.

Thanks ,
Ganeshbabu R


(Shane Connelly) #2

Sounds like you're trying to reindex some documents. You can't directly index/reindex docs by a search result, e.g. something contained in a scroll.

What you can do is to perform the query/scan and, as you scroll through, load the _source of each hit to be reindexed, best case through the bulk API. So pseudocode for each scroll becomes

bulk_index = []
foreach hit in hits
  //modify hit if necessary, e.g. change the index it's going into
  bulk_index[] = hit
  if (bulk_index.length > some_max)
    elasticsearch.index_bulk(bulk_index)
    bulk_index = []
  endif
endfor
elasticsearch.index_bulk(bulk_index)

There are tools out there to do this for you, including the reindex helper in elasticsearch-py


(system) #3