It all started so well. My project involves loading 400 million documents, which are 80-character strings (lines of words mostly in English). I'm now battling through the third attempt to load the data, but every time I reach about 300 million documents I run into timeout problems. Not always at exactly the same point or time of day. The various fixes I've tried help, but at the cost of much worse load performance. I'm using helper.bulk via Python.
First attempt: 1 index, 1 shard. Completely failed at 320m records (timeout), and would not load anything more. Chunk size: 5000. Loaded 7,000 records per second up to that point
Second attempt: 1 index, 1 shard. Repeated the previous test with no changes to see if it recurred. It did at 290 million.
Third attempt: 1 index, 4 shards. Failed at 280 million. Restarted with timeout increased to 30 seconds. It would run a short while then fail again. Reduced the chunk size to 1000, and it is now running but the performance has dropped to 500 records per second.
Am I hitting some scalability limit? Or is there something important that I ( a newbie) am missing?
It's running on a single node dev server (Ubuntu 18.04) with 16Gb RAM and Elasticsearch 7.6.2, competing with nothing else.