We're looking at needing to load around 200 million records and with the limits provided, it seems like app search just wouldn't be reasonable. Few questions:
Does the 3000 docs/minute persist even for self-hosted versions of App Search? If not, is there a way around it?
@nickchow mentioned that people often write their own tooling for bulk upload. We've written a small script that chunks and uploads data using the node.js library. Effective, our writes come down to client.indexDocuments(ENGINE_NAME, chunk);. Is there a way to turn up throughput or another endpoint we should investigate for this amount of data?
Lastly, with 200mil records, does app search even make sense or is that starting to hit a size where we'd be better off using Elastic? I know this question is VERY open-ended but I'll noted that our search needs aren't super complicated, we just have a ton of data.
Is there a limit of 3000 per minute? I thought the only limit is 100 per request. My bulk tool does 100 every second without error. Although from time to time it falls. Maybe that's it. But it almost always runs smoothly.
AFAIK the 3000 limit only applies to the App Search hosted on swiftype.com. If you're using a downloaded version of App Search or an instance on Elastic Cloud, then that does not apply.
It is correct, I am indexing 5 million docs faster than that and I have had no problems. However, I have never been able to finish an index well because I get a 503 error. What can I do? I'm thinking of splitting my files and doing indexing by parts.
Im having another problem. I have indexed 3 million documents. But in my engine only appear 1 million. Theres some delay? With the same tool if I index less than a million document all of them appear in the engine, but if I index more than a million, is like I get a limit of 1 million and so. What could be?
In the meantime, could you tell us more about your environment? What version of App Search are you using? How do you have it deployed? What OS are you on?
Sorry you are experiencing issues with bulk indexing into App Search.
In terms of scale, we have not tested the self-managed version at this kind of datasets and we would be really interested in the results of your experiment in applying App Search to this problem.
To help us better reason about the scenario you're experiencing, would you be able to share some details about the scale of the infrastructure you're using for the project and the platform it all runs on?
Specific details that would help:
What version of the product is used?
What OS do you use to run it?
Do you use containers in any way? (vanilla docker or kubernetes, etc)
How many and how powerful (CPU and RAM) are your App Search instances?
How many and how powerful (CPU and RAM) are your Elasticsearch instances, what kind of disk storage powers your ES cluster?
Another important aspect is the dataset itself:
What is your average document size?
How many fields do you have in your dataset?
Thank you for any information you could provide us and, once again, sorry you are experiencing issues with the product.
Hi Jason, maybe was my fault. I was doing some test to index using multiple threads. I did a first test with a few thousand documents and it worked, with 3 threads. But then with 3 million document test, I only get indexed the first million. Maybe is because the indexing dont accept multi threads and they went too fast, and app search only get the documents of the first thread.
Im using a python indexing tool in windows 7.
The API supports indexing from however many threads (as long as the instance could keep up with the load, so not indefinitely). When indexing those 3MM docs, did you receive successful responses for all requests (HTTP 200 responses and empty error field in the response as indicated in the docs)?
Thanks for the response here. I've been so far running only locally with the instructions found here (App Search, Self Managed, Installation) but since the note at the bottom says:
Note : Not for production use. Authentication is disabled. App Search relies on only one Elasticsearch node.
We're inclined to invest our energy less here than with elastic where we could have security since that will be paramount so with that said, we haven't tried to go full big(-ish) data yet.
We have <10 fields per document. They're people records with common info like first and last names, age, city, state and a few custom fields.
If there's a way forward with a secure, self-hosted version of App Search, we could potentially continue experimenting but we're on a bit of a tight timeline and don't have the resources to try multiple search solutions.
Yes, empty error and 200 response. But then, I saw only 1/3 of the documents in the engine. Maybe was too many threads or too fast. I tried yesterday with just one thread and all the files were indexed. Its gonna take more time to index all the docs. But I feel safer using only one thread now. Maybe the documentation should be more clear in this points: how many docs per minute support, how many docs per request, how many threads.
For example, when I index documents with 100 docs per request I use to get a 503 error sometimes, so I decreased the doc to 90 per request to avoid the overload problem.
EDIT: I indexed now many documents with two threads without problem. But with four threads, not all of the docs are indexed, and I didnt get an error message.
EDIT2: Indexing another bunch of documents, with two threads, the 20% of the doc were not indexed. I didnt get an error.
@imjared That note is unfortunately misleading. I believe it's simply attempting to indicate that that particular docker-compose.yml file is not sufficient for production.
I think I see what is happening. I didnt lose any document. The documents are there, they indexed correctly. But what happened is the tha "total document" in the Engine Overview page is not counting them.
I have indexed another million docs, without any error and the "total documents" count is showing the same old number. But the docs are in the engine.
I think this is a problem of the platform.
EDIT: Few hours later, the total documents count reflected the real total.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.