Getting next 10k documents with AppSearch.list_documents()

marc.schwarzschild · October 17, 2023, 1:40pm

Hi,
I'm trying to get IDs for all our 300k+ documents. The AppSearch API has list_documents which only lists 10k documents. Is there an argument I can use with it to get the next 10k documents?

I have searched and found that lots of other users are struggling with how to access all their documents. I even considered exporting/dumping all our data to my laptop and working outside elastic.co. It seems like our data is held hostage with no apparent way to do this.

PLEASE ADVISE!

BTW, I'm doing this in Python.

Thank you,
Marc

sholzhauer · October 17, 2023, 2:09pm

I am assuming you are referencing this documentation

In that case you should be able to use the page attribute in something like below to get all of your results.

import requests

resp = requests.get(
    url="<your_es_endpoint>",
    auth=("<your_awesome_user>", "<your_extremely_secure_password>")
).json()

page = 0
results = []
while page != resp["meta"]["page"]["current"]:
    results += [res for res in resp["results"]]
    resp = requests.get(
        url="<your_es_endpoint>",
        auth=("<your_awesome_user>", "<your_extremely_secure_password>"),
        json={
            "size": 100,
            "page": f"{page + 1}"
        }
    ).json()
    page += 1

marc.schwarzschild · October 17, 2023, 2:30pm

Thank you for the quick reply. What would the get be if I have the AppSearch host/key rather than endpoint/user/pw? I think host=endpoint.

sholzhauer · October 17, 2023, 2:35pm

you mean the url (hostname) and an api key?

I think this should work (not sure)

requests.get(
    url="<hostname>/api/as/v1/engines/,index>/documents/list",
    headers={
        "Authorization": "Bearer <apikey>"
    }

marc.schwarzschild · October 17, 2023, 2:51pm

Could this be done with the python Elasticsearch or AppSearch packages?

I successfully did a get() with a 200 return but it is just html with this comment at the end: "This Elastic installation has strict security requirements enabled that your current browser does not meet".

Once again, isn't there an easy way to get all our ids for 300k+ documents via Elasticsearch or AppSearch?

Thank you.

sholzhauer · October 17, 2023, 5:14pm

You might need to add the json header:

requests.get(
  url="<stuff>",
  headers={
    "Authorization": "Bearer <apikey>",
    "Content-Type": "application/json"
  }
)

If that doesn't work and there is a package, probably but I am not familiar with the module(s) and tend to just use the API. The API's are well build and it removes a component to just use it directly.

system · November 14, 2023, 5:15pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Get all ids with Python Elasticsearch	2	328	November 24, 2023
Fetching all docs in an app search index Elastic Search elastic-app-search	5	1941	January 7, 2019
Get all documents from an index Elasticsearch	10	108513	June 21, 2017
App Search Document Limit? Elastic Search elastic-app-search	3	980	July 15, 2020
How to get results over 10K in App search Elastic Search elastic-app-search	11	8771	February 27, 2024

Getting next 10k documents with AppSearch.list_documents()

Related topics