I am currently running into an issue on which i am really stuck.
I am trying to work on a problem where I have to output the Elasticsearch documents and write them to csv. The docs range from 50,000 to 5 million.
I am experience serious performance issues and I get a feeling that I am missing something here.
Right now I have a dataset to 400,000 documents on which I am trying to scan and scroll and which would ultimately be formatted and written to csv. But the time taken to just output is 20 mins!! That is insane.
Here is my script:
import elasticsearch.helpers as helpers
es = elasticsearch.Elasticsearch(['http://XX.XXX.XX.XXX:9200'],retry_on_timeout=True)
scanResp = helpers.scan(client=es,scroll="50m",index='MyDoc',doc_type='MyDoc',timeout="50m",size=1000)
start_time = time.time()
for resp in scanResp:
data = resp
print("--- %s seconds ---" % (time.time() - start_time))
I am using a hosted AWS m3.medium server for Elasticsearch.
Can anyone please tell me what I might be doing wrong here?