How to improve performance of scan method?

(Samvid Kulkarni) #1

I am trying to get data from elasticsearch using elasticsearch-dsl python library. I need to get all the data for last 15 min. Issue is retrieving data is extremely slow. It takes lot of time for 2.2 million hits. Here is my code

start_time = time.time()  
	client = Elasticsearch(['IP_HERE'])
	s = Search(using=client, index="firewallv2-*", doc_type = 'doc').filter('range', **{'@timestamp': {'gte': 'now-15m' , 'lt': 'now'}})
	response = s.execute()
except Exception as e:
	print("error in getting data from FIREWALL")

	for hit1 in s.scan():
		source_ip.append(hit1.to_dict().get('Source IP'))
		destination_ip.append(hit1.to_dict().get('Destination IP'))
		destination_port.append(hit1.to_dict().get('Destination Port'))
		source_port.append(hit1.to_dict().get('Source Port'))

except Exception as e:
	print("not able to parse json data")

elapsed_time = time.time() - start_time
print("Time to get data from server " + str(elapsed_time))

There is more to code but I am just posting the main slow component. Rest is pure python code. Below is my output

Time to get data from server 893.599892855
Time to store data into variables 27.647258997
Time to process for loop 9.32531404495

All time output is in seconds and you can see that it takes huge amount of time to retrieve 2.2 million hits.

I also tried using bulk_size=10000 and even changing builk_size to various values but not success.

(Nik Everett) #2

There are a few things you can do, like setting the size in the search higher. The default search size is more for regular search than scrolling. Usually your better off trying to work these sorts of things into aggregations if you can. The documents are stored on disk in such a way that it is faster to run aggregations than it is to return the entire _source.

(Samvid Kulkarni) #3

Thank you very much for replying and sorry for late response. i have made the changes as you suggested and i was about to lower the time taken to 710 seconds but I am not able to lower it further.

We have only one node setup so could that be cause problem? or is there any other issue you can think of?