Fastest way to retrieve all ids in an index?

akshaymaniyar · May 14, 2019, 8:46am

I have an index which has around 300 million documents. What is the fastest recommended way to retrieve all the documentIDs from the index?

Currently I am using the below python script for doing a scan and scroll to retrieve all the IDs. However this takes around 20-24 hours to fetch all the IDs

import csv
import json
import sys
import requests

indexName = sys.argv[1]
url = "http://<<es_host>>/" + str(indexName) + "/_search"

querystring = {"scroll":"1m"}

payload = {
            "query": {},
            "size": 1000,
            "stored_fields": []
}


headers = {
    "Content-Type": "application/json"
}

response = requests.request("POST", url, data=json.dumps(payload), headers=headers, params=querystring)
response = json.loads(response.text)

#print response
f = open('results.txt','w')
while True:
    # print(len(response['hits']['hits']))
    if(len(response['hits']['hits']) == 0):
        break
    for hit in response['hits']['hits']:
        fsn = hit['_id']
        f.write(fsn)
        f.write("\n")

    scroll_id = response['_scroll_id']
    #print scroll_id

    payload = {
        "scroll_id": scroll_id,
        "scroll" : "1m"
    }

    #print (payload)
    url = "http://<<es>>/_search/scroll"
    response = requests.request("POST", url, data=json.dumps(payload), headers=headers)

thn · May 14, 2019, 10:54am

Don't know what's in your data but I think there are a few options that you can think about

do a scan and roll on the document's timestamp so you can you have multiple processes running on different date/time related range
do a scan and roll on the document's "type/category/topic/etc"

dadoonet · May 14, 2019, 11:06am

I think you could be a little faster if you sort on the internal id _doc as the documentation says.
Also have a look at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#sliced-scroll

system · June 11, 2019, 11:18am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
All ids Elasticsearch	3	7412	July 6, 2017
How can i speed up getting all document in an index Elasticsearch	2	2107	July 10, 2020
Best way to create a list of all _ids in an index (Up to date version) Elasticsearch	2	359	October 1, 2021
Get all ids with Python Elasticsearch	2	328	November 24, 2023
Get all documents from an index Elasticsearch	10	108444	June 21, 2017

Fastest way to retrieve all ids in an index?

Related topics