Getting next 10k documents with AppSearch.list_documents()

Hi,
I'm trying to get IDs for all our 300k+ documents. The AppSearch API has list_documents which only lists 10k documents. Is there an argument I can use with it to get the next 10k documents?

I have searched and found that lots of other users are struggling with how to access all their documents. I even considered exporting/dumping all our data to my laptop and working outside elastic.co. It seems like our data is held hostage with no apparent way to do this.

PLEASE ADVISE!

BTW, I'm doing this in Python.

Thank you,
Marc

I am assuming you are referencing this documentation

In that case you should be able to use the page attribute in something like below to get all of your results.

import requests

resp = requests.get(
    url="<your_es_endpoint>",
    auth=("<your_awesome_user>", "<your_extremely_secure_password>")
).json()

page = 0
results = []
while page != resp["meta"]["page"]["current"]:
    results += [res for res in resp["results"]]
    resp = requests.get(
        url="<your_es_endpoint>",
        auth=("<your_awesome_user>", "<your_extremely_secure_password>"),
        json={
            "size": 100,
            "page": f"{page + 1}"
        }
    ).json()
    page += 1

Thank you for the quick reply. What would the get be if I have the AppSearch host/key rather than endpoint/user/pw? I think host=endpoint.

you mean the url (hostname) and an api key?

I think this should work (not sure)

requests.get(
    url="<hostname>/api/as/v1/engines/,index>/documents/list",
    headers={
        "Authorization": "Bearer <apikey>"
    }

Could this be done with the python Elasticsearch or AppSearch packages?

I successfully did a get() with a 200 return but it is just html with this comment at the end: "This Elastic installation has strict security requirements enabled that your current browser does not meet".

Once again, isn't there an easy way to get all our ids for 300k+ documents via Elasticsearch or AppSearch?

Thank you.

You might need to add the json header:

requests.get(
  url="<stuff>",
  headers={
    "Authorization": "Bearer <apikey>",
    "Content-Type": "application/json"
  }
)

If that doesn't work and there is a package, probably but I am not familiar with the module(s) and tend to just use the API. The API's are well build and it removes a component to just use it directly.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.