Get all documents from an index

(Andreas Wachter) #1

Is it possible to get all the documents from an index?
I tried it with python and requests but always get
query_phase_execution_exception","reason":"Result window is too large, from + size must be less than or equal to: [10000] but was [11000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.

I have no idea how the scroll api works and the documentation isn't helpful for me either.
Could someone please help me.

(Heinmci) #2

You said you tried with python, so I'll just show you what worked for me :

es = Elasticsearch(['http://yourElasticIP:9200/'])
doc = {
        'size' : 10000,
        'query': {
            'match_all' : {}
res ='indexname', doc_type='typename', body=doc,scroll='1m')

Then you get a reponse with your matching documents and also an attribute named '_scroll_id'

So you can do

scrollId = res['_scroll_id']
es.scroll(scroll_id = scrollId, scroll = '1m')

Where res is the result of your previous es search.
You can do the es.scroll as many times as you need, just remember to update the scrollId value each time you do a new request
Sorry if I wasn't very clear

(Andreas Wachter) #3

Thank you.
Size indicates how many hits i get?

(Heinmci) #4

Yes, but as you saw, it can't be over 10 000, so you have to use the scroll API, don't think you have another choice

(Andreas Wachter) #5

Ok, so I will get the first 10 000 results. How do I get the rest?
Sorry for my stupid asking, but I am missing the forest through the trees right now.

(Heinmci) #6

If you look at the code above, the es.scroll function allows you to get results past 10 000.

es = Elasticsearch(['http://x.x.x.x:9200/'])
doc = {
    'size' : 10000,
    'query': {
        'match_all' : {}

res ="myIndex", doc_type='myType', body=doc,scroll='1m')
scroll = res['_scroll_id']
res2 = es.scroll(scroll_id = scroll, scroll = '1m')

In this example, you have your first 10 000 hits in res, and the next 10 000 in res2. If you want results from 20 000 to 30 000, you just get the new scroll id value from res 2!

(Andreas Wachter) #7

AHHHH... forest, there it is.
Thank you for your help. It finally made click.

(Heinmci) #8

No problem, have a nice day!

(Andreas Wachter) #9

on one of my indexes I get no data just
{'timed_out': False, 'hits': {'total': 1843, 'max_score': 1.0, 'hits': []}, '_shards': {'successful': 5, 'total': 5, 'failed': 0}, 'terminated_early': False, '_scroll_id': 'DnF1ZXJ5VGhlbkZldGNoBQAAAAAAARfQFm5rMUVCeUxTVDJHUm5qZ2dBQkpJMncAAAAAAAExGBZyNFIxMV93QVRqT0wtTTNoZ1dUenN3AAAAAAABF88WbmsxRUJ5TFNUMkdSbmpnZ0FCSkkydwAAAAAAAPrrFnpFTW9aaHRPUzd1X0Y0UHRORTFpSFEAAAAAAAExFxZyNFIxMV93QVRqT0wtTTNoZ1dUenN3', 'took': 2}
Any idea on that?

(Heinmci) #10

Sorry, not sure why that is.
Only thing that comes to mind is that size was set to 0, other than that I don't know

(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.