Elasticsearch Bulk Write is slow using Scan and Scroll

(Amit Pandita) #1

Hi Group,

I am currently running into an issue on which i am really stuck.
I am trying to work on a problem where I have to output the Elasticsearch documents and write them to csv. The docs range from 50,000 to 5 million.
I am experience serious performance issues and I get a feeling that I am missing something here.

Right now I have a dataset to 400,000 documents on which I am trying to scan and scroll and which would ultimately be formatted and written to csv. But the time taken to just output is 20 mins!! That is insane.

Here is my script:

import elasticsearch
import elasticsearch.exceptions
import elasticsearch.helpers as helpers
import time

es = elasticsearch.Elasticsearch(['http://XX.XXX.XX.XXX:9200'],retry_on_timeout=True)

scanResp = helpers.scan(client=es,scroll="50m",index='MyDoc',doc_type='MyDoc',timeout="50m",size=1000)

start_time = time.time()
for resp in scanResp:
data = resp
print data.values()[3]

print("--- %s seconds ---" % (time.time() - start_time))

I am using a hosted AWS m3.medium server for Elasticsearch.

Can anyone please tell me what I might be doing wrong here?

(Mark Walkom) #2

So the size parameter is what it gets from each shard, so if you have (eg) 5 shards, that's 2 millions docs!
I'd start by reducing that to something considerably smaller and see if it helps.

(Amit Pandita) #3

@warkolm Yes i did that already, in fact i started the size from 10, then 50,100,150,200,300,500,100 ...... The best result was at 200 where i got the result in 18 seconds that too for just 4000 documents. That is a really bad figure. What else apart from the size do u think i might be missing?

(Mark Walkom) #4

Are you monitoring statistics on the cluster?
What do they tell you?

(system) #5