How to import 10 millions of documents (about 5G) into app search?

Hi There,
How to import 10 million of documents (about 5G) into the App search (self-managed)?
What's the best practice to do it? Is there any limitation to do this.

Thanks,
Wei.

All of our limits are noted here: https://swiftype.com/documentation/app-search/limitations#query-limits. I think these all apply to Self-Managed as well, currently, though we are working to lift some of them.

I don't believe we have any mechanisms for bulk importing that many documents. You'll probably need to script something against our Indexing API: https://swiftype.com/documentation/app-search/api/documents#create.

Any thoughts here @qhoxie @nickchow?

Thank you for replying. I think that would be a nice feature that can bulk importing many documents especially help for the fist/full data importing/migration.

I agree @viphuangwei

Thanks for your comment @viphuangwei. Normally folks implement their own tooling for bulk importing, but we'll keep your feedback in mind.

Our client libraries might help, if you haven't seen those already: https://swiftype.com/documentation/app-search/getting-started#build

I have implemented my own script to upload 10K records, but now getting "Rate limit exceeded" error. So how do we by pass this limitation? Using the app-search-python.

Thanks.

@werewolf_ninja Unfortunately I believe the only solution to this currently is to slow down your ingestion script.

Yes, that is what I ended up doing. I think this may differ from one account to another, so far was able to make 5 consecutive api calls every three minutes. In this first attempt, uploaded 5454 records in 31 minutes. At this point of time, the UI upload feature can handle up to 5000 records, so this api rate limitation is very limiting and painful. @JasonStoltz Thank you for your prompt reply, I hope you can bring this issue to the team, and hopefully it gets prioritized :slight_smile:

@werewolf_ninja I will see if we can get this prioritized, thanks for your perseverance!

1 Like

@werewolf_ninja Are you using self managed App Search?

@werewolf_ninja FWIW, you should be getting much better throughput than 5000 documents every 30 minutes. From what I can see, we should be allowing up to 3000 documents per minute, assuming you are using a paid App Search account.

You can use the following headers in the response for reference:

X-RateLimit-Limit → <number of documents allowed in 1 minute>
X-RateLimit-Remaining → <number of documents remaining this minute>

If you are rate limited and receive a 429 response, you can check the Retry-After header which will tell you how long to wait before performing a new request.

1 Like

As far as the self managed App Search, no, but I am still on the trial version.
I do appreciate your response about the X-Rate-Limit, that is very helpful. Will also try configuring my script for 3000 docs per each minutes, to see if I would still get a rate limit error. I will post my results here sometime this week. :+1:

@werewolf_ninja Great. Also, free trials are rate limited to 800 docs per minute, not 3000.

@JasonStoltz Thank you for the clarification about the limitation for trial accounts. That explains why I was having issues :slight_smile:

1 Like