Scroll API taking long time and using a lot of memory

Hi,

I have a single machine with following configs
RAM: 32 GB (allocated 12 gb to ES)
HDD
I have indexed 17 million rows x 3000 columns (200 gb) data with 5 shards and no replicas
I want to retrieve a subset of data 1 million x 3000 (10gb) and store it in csv file.
I have tried various ways and it takes 9 hours to complete.
I came across the scroll api, but it is using a lot of memory and my process slows down eventually.
i am using the following query:

result_dict = es.search(index="genes",doc_type="test",scroll='1m',size=5000, body={
"query": {
"terms": {
"Cadd_GeneName.keyword": arr
}
},
"sort":"_doc"
}

Any help would be greatly appreciated.

Thanks,
-Raj

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.