Bulk import via node.js script

(Charlie Schliesser) #1

What are some best practices for bulk importing docs via node.js? I'm using the module found at https://www.npmjs.com/package/elasticsearch and need to import ~ 20 million docs.

I've tried importing them in batches of 100-10,000 and while some of the requests go through, many fail due to a timeout:

{ [Error: Request Timeout after 20000ms] message: 'Request Timeout after 20000ms' }

I've tried with a range of maxSockets and timeouts, but it's not been having an impact. I believe the issue is that I'm making too many paralell requests, but I'm unsure how I'd throttle that considering the asynchronous nature of how I'm parsing through a very large CSV (readline).

What are some settings that I can tweak to get this performing better? Do I need to wait for requests to complete before I queue further requests? I was looking at another thread (Bulk inserting is slow) and believe some of this can help me but am unfamiliar how to port what's discussed there to my node.js script. If python has existing library support that will run this in a smarter fashion I'm happy to try that. Thanks for any help :smile:

(Konrad Beiske) #2

Hi Charlie
As usual there is no better answer than testing, but as a rule of thumb I would try on or two concurrent batches per node in your cluster. In an asynchronous model you need to use an on completion event to transmit the next batch.

Beyond that make your indexing task record time consumed preparing batches for sending to Elasticsearch and time consumed by Elasticsearch. Preparing a batch of one thousand json documents usually takes some time too and thus you want to make sure that your application is capable of concurrently preparing the next batch while the current batch is indexing.

Given that you're parsing a large file, remember to process it in a stream fashion so that you don't have the entire thing in memory.

There is no point in maximizing the batch size. Depending on document size, a few hundred or a thousand documents is usually a good number per batch. While larger batches does reduce overhead, the relative improvement decreases and eventually you get batch sizes to large for the receiving node.

Since you're running things on Found you can also just scale up your cluster during the batch indexing.

Best regards
Konrad Beiske
Software Engineer

(system) #3