Help needed with Proper usage of scroll api using python: Getting the same results


(Juggernaut Panda) #1

Hello all.

I have to parse a document which has more than 10000 hits. The natural choice was to opt for scroll api.

I have read the documentation from elastic and have done the following:

import requests

resp=requests.post('http://localhost:9200/netflow*/_search?pretty=true&size=100&scroll=5m')

This gave me a scroll id.

I have stored that scrollID in a variable and did the following:

SearchExp="http://localhost:9200/_search/scroll?pretty=true&scroll=5m&scroll_id="+ScrollID
response = requests.post(SearchExp)

However, everytime I run the program, I get the same 100 results [since size=100].

What should I do to get the next set of results and read the full document above 10000 ??


(David Pilato) #2

I never tried to pass those parameters as query params. I'm unsure if it's supposed to work.

The documentation says:

POST  /_search/scroll 
{
    "scroll" : "1m", 
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" 
}

Could you try it that way instead?

If it does not work please share all details (responses and requests included).


(Juggernaut Panda) #3

Hello. It does work this way but I have to use a request i.e. the url way.

The reason is the scroll_id generated is so long that it doesnt support to be fit inside HTTP post method


(David Pilato) #4

Great. So documentation says it all IMO.
I believe this has been removed or was not supported. I did not check the code.
May be the parameter name is a bit different? Like _scroll_id instead?

The reason is the scroll_id generated is so long that it doesnt support to be fit inside HTTP post method

I would expect the opposite as the length of a POST with body has no limit (or at least super high limit) but the URL length has a lower limit for sure.


(Juggernaut Panda) #5

Yes. I was confused.

However, I fixed my issue as follows:

import requests,re,string,json

def main(args):

resp=requests.post('http://localhost:9200/netflow-2018.02.20/_search? pretty=true&size=100&scroll=5m')

resp  =json.loads(resp.content)
#print (resp)
sid = resp['_scroll_id']
print (sid)
while(True): # continue this loop until hits become zero
	headers = {
	'Content-Type': 'application/json',
	}

	data = '\n{\n    "scroll" : "1m", \n    "scroll_id" : "'+sid+'" \n}'

	response = requests.post('http://localhost:9200/_search/scroll', headers=headers, data=data)
	response  =json.loads(response.content)
	if not (response['hits']['hits']):
		break;
return 0

if name == 'main':
import sys
sys.exit(main(sys.argv))


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.