How to fetch large data in parallel from elastic search?
Using Scroll API I am able to fetch complete data from elasticsearch. But it is too slow. Takes around 40 seconds to fetch 500000 records.
I am trying to use multiprocessing module in python to fetch data from scroll API simultaneously...
Here are the details of the code:
elastic_url = 'http://0.0.0.0/'+index_name+'/_search?scroll=20m'
sroll_api_url = 'http://0.0.0.0/_search/scroll'
r1=requests.request(
"POST",
elastic_url,
data=json.dumps(query_body),
headers=headers
)
process=[]
##create 5 processes and call the fetch_data function and start the processes
for _ in range(5):
p1=multiprocessing.Process(target=fetch_data_scroll_id,args=([scroll_payload],))
p1.start()
process.append(p1)
## p.join() is used so that each process will wait if they finished off early.
for p in process:
p.join()
I get the following error :
{
"_scroll_id": "DnF1ZXJ5VGhlbkXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXc3UXJHUy1PNUdvQ2lyencAAAAAACCS6BY0bXdKN1lHN1FyR1MtTzVHb0Npcnp3AAAAAAAgkuYWNG13SjdZRzdRckdTLU81R29DaXJ6dwAAAAAAIJLpFjRtd0o3WUc3UXJHUy1PNUdvQ2lyencAAAAAACCS5xY0bXdKN1lHN1FyR1MtTzVHb0Npcnp3",
"_shards": {
"failures": [
{
"index": null,
"shard": -1,
"reason": {
"type": "search_context_missing_exception",
"reason": "No search context found for id [2134760]"
}
},
{
"index": "logstash-2020.07.07",
"reason": {
"type": "search_context_missing_exception",
"reason": "No search context found for id [2134757]"
},
"shard": 0,
"node": "XXXXXXXXXXXXXXXXXXX"
}
],
"total": 5,
"successful": 3,
"skipped": 0,
"failed": 2
},
"timed_out": false,
"took": 17,
"hits": {
"total": 106157,
"max_score": null,
"hits": []
},
"terminated_early": true
}
Please help.....