Hi,
I have a single machine with following configs
RAM: 32 GB (allocated 12 gb to ES)
HDD
I have indexed 17 million rows x 3000 columns (200 gb) data with 5 shards and no replicas
I want to retrieve a subset of data 1 million x 3000 (10gb) and store it in csv file.
I have tried various ways and it takes 9 hours to complete.
I came across the scroll api, but it is using a lot of memory and my process slows down eventually.
i am using the following query:
result_dict = es.search(index="genes",doc_type="test",scroll='1m',size=5000, body={
"query": {
"terms": {
"Cadd_GeneName.keyword": arr
}
},
"sort":"_doc"
}
Any help would be greatly appreciated.
Thanks,
-Raj