It all started so well. My project involves loading 400 million documents, which are 80-character strings (lines of words mostly in English). I'm now battling through the third attempt to load the data, but every time I reach about 300 million documents I run into timeout problems. Not always at exactly the same point or time of day. The various fixes I've tried help, but at the cost of much worse load performance. I'm using helper.bulk via Python.
First attempt: 1 index, 1 shard. Completely failed at 320m records (timeout), and would not load anything more. Chunk size: 5000. Loaded 7,000 records per second up to that point
Second attempt: 1 index, 1 shard. Repeated the previous test with no changes to see if it recurred. It did at 290 million.
Third attempt: 1 index, 4 shards. Failed at 280 million. Restarted with timeout increased to 30 seconds. It would run a short while then fail again. Reduced the chunk size to 1000, and it is now running but the performance has dropped to 500 records per second.
Am I hitting some scalability limit? Or is there something important that I ( a newbie) am missing?
It's running on a single node dev server (Ubuntu 18.04) with 16Gb RAM and Elasticsearch 7.6.2, competing with nothing else.
What type of hardware are you using? What type of storage do you have? Are you specifying your own document IDs? Do you have any non default settings? What is your heap size?
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Storage: Elasticsearch is installed on the boot disk (which is an SSD), but the files are being written to a 12Tb hard disk. The drive is 2% full. That's the only non-default setting.
I am creating my own IDs in Python using the UUID library as follows: str(uuid.uuid4()),
As it sounds like you are using spinning disks I recommend looking at disk I/O, disk utilisation and iowait using e.g. iostat. I would not be surprised if this is the bottleneck. As you are assigning your own document IDs each insert has to be handled as a potential update which requires Elasticsearch to check if it already exists. This result in disk reads which will further increase disk utilisation and slow things down.
Thanks. It seems to make sense to leave the UUID creation to Elasticsearch. I'll try that next. Disk IO seems a less fruitful place to look because failures could happen anytime, not cluster around 300m records as they do.
It takes a little while to check these points, but I'll report back.
Moving the ID creation from the application to Elasticsearch made all the difference. I was able to load 445m documents in 8 hours. Performance varied between 12,000 and 18,000 documents per second onto a spinning drive, but the key thing is that some of the best performance was later on.
The video was useful for the medium term when I get closer to productionising the application . I'll repeat the ingestion test onto SSD. For me, query performance will be the main consideration.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.