Fetching all docs in an app search index


#1

Hi,
I'm trying to get the ids of all documents in my app search index. I tried to simply iteration of searches with an empty query and incrementally increasing the page-number. This seemed to work fine for the first 10 requests.

This is what my requests looks like:

{
"query": "",
"result_fields": {
"id": { "raw": {} }
},
"page": {"size": 1000,"current": [PAGENUMER] }
}
Where [PAGENUMBER] is 1 for the first request, 2 for second and so on…

This is the result of the 10th request:
{
"meta": {
"warnings": [],
"page": {
"current": 10,
"total_pages": 92,
"total_results": 91035,
"size": 1000
},
"request_id": "39684c716fe14725a70406a1a71789e4"
},
"results": [...] <--- 1000 results here
}

Working as expected and showing all 91035 docs in the index and that there are a total of 92 pages.
But the result of the 11th request:
{
"meta": {
"warnings": [],
"page": {
"current": 11,
"total_pages": 0,
"total_results": 0,
"size": 1000
},
"request_id": "39684c716fe14725a70406a1a71789e4"
},
"results": [] <--- 0 results here
}

Suddenly it indicates that there are no docs found at all…

The documentation says that search-request should support up to 1000 in page size and up to 500 pages, but it seems like it supports max 10 pages when page-size is 1000. Or is there some setting I need to change to support more result-pages?

Or is there some other way I can request ids of all docs in the index?

Any help would be appreciated


#2

I found out now that there is a list-method (/documents/list) this is specifically meant to get all docs in an index. But this will return all data (not just the id-field) and has a max pagesize of 100 docs.

This means I'll have to spend at least 10 times as many api-requests to get this done and that might push me above the monthly limit and resulting in extra licensing-costs.

So I'd still be interested to know of any alternatives if they exist.


(Kellen Evan) #3

Frank!

You are correct. The most effective method of returning documents would be to iterate over the documents/list endpoint, like so:

curl -X GET 'https://host-xxxxxx.api.swiftype.com/api/as/v1/engines/example-engine/documents/list' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer private-xxxxxxxxxxxxxxxxxxxx' \
-d '{
  "page": {
    "current": 1,
    "size": 100
  }
}'

As you pointed out, this will return full documents, not just the id. You will need to parse out the id when assembling your list.

Thanks for posting, I wish you an excellent end to your week.


#4

Hi goodroot and thanks for the reply. I tried to use the documents/list endpoint now and sadly that doesn't work either. When the "current"-attribute in the request gets higher than 100 the response always returns the 100th page. In other words, it's not possible to get more than the first 10.000 documents (100 pages with 100 documents each) with this endpoint too.


(Kellen Evan) #5

Frank --

I poked around to try and find you a better answer, but that is correct.

The limit of both current and size is 100, and the endpoint cannot be used to return more than 10,000 documents.

I understand this creates a gap. The documents, and their ids, are available by querying or through the documents dashboard view, but that isn't helpful to one looking to generate a comprehensive list. :confused:

The limit may rise in the future, but as of now it is kept restrictive. If this is a major blocker in your use-case, please email support@swiftype.com, referencing this ticket so that we can learn more.

Enjoy the week,

Kellen