Speeding up indexing a very large file

I have about a terabyte of data I need to index on a weekly or so basis. The data is JSON blobs separated by endlines.

I have written an import script in Node.js that ingests the file and requests 40 parallel index operations to 3 elasticsearch hosts.

There are about 14.6 million records in total, and my import job is currently running at about 12,500 records per minute. At this rate it will take about 20 hours to import the whole file. I can scan the file much faster than this (2.4 mil records in 1.4 min), so I know the bottleneck in this process is the Elasticsearch index.

Currently the three Elasticsearch servers are running at 60%/30%/30% CPU. So it doesn't seem like I'm saturating their capacity. I'm running the import job from the server that's at 60%, so the data isn't moving very far.

Anyone have tips I could try to speed this up? Should I perhaps try buffering them into bulk operations?


If I could knock this from 20H down to 10H, that would be a huge win.

EDIT: I should note that I'm only indexing a small percentage of the data in the file. So I think it's more about the request volume than the data size.

I would definitely be looking at chunking those as bulk requests rather than throwing each as an index operation. You'll get much more bang for your buck per request. You can try running a single bulk op per node you're ingesting to, tweaking the size of those bulk requests to find the sweetspot for your setup.

@KodrAus thanks for the quick response! Hmm... I think I would want to chunk up maybe 150 MB requests or something like that. But that sounds like a reasonable way to proceed.

Try smaller bulk requests. The general recommendation is to keep them around 5MB in size.

1 Like

How many shards?
What version? What OS? What JVM?

You may also want to look at the documentation.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.