Speeding up indexing a very large file

bmason · May 1, 2017, 11:12pm

I have about a terabyte of data I need to index on a weekly or so basis. The data is JSON blobs separated by endlines.

I have written an import script in Node.js that ingests the file and requests 40 parallel index operations to 3 elasticsearch hosts.

There are about 14.6 million records in total, and my import job is currently running at about 12,500 records per minute. At this rate it will take about 20 hours to import the whole file. I can scan the file much faster than this (2.4 mil records in 1.4 min), so I know the bottleneck in this process is the Elasticsearch index.

Currently the three Elasticsearch servers are running at 60%/30%/30% CPU. So it doesn't seem like I'm saturating their capacity. I'm running the import job from the server that's at 60%, so the data isn't moving very far.

Anyone have tips I could try to speed this up? Should I perhaps try buffering them into bulk operations?

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

If I could knock this from 20H down to 10H, that would be a huge win.

EDIT: I should note that I'm only indexing a small percentage of the data in the file. So I think it's more about the request volume than the data size.

KodrAus · May 1, 2017, 11:18pm

I would definitely be looking at chunking those as bulk requests rather than throwing each as an index operation. You'll get much more bang for your buck per request. You can try running a single bulk op per node you're ingesting to, tweaking the size of those bulk requests to find the sweetspot for your setup.

bmason · May 1, 2017, 11:44pm

@KodrAus thanks for the quick response! Hmm... I think I would want to chunk up maybe 150 MB requests or something like that. But that sounds like a reasonable way to proceed.

Christian_Dahlqvist · May 2, 2017, 4:33am

Try smaller bulk requests. The general recommendation is to keep them around 5MB in size.

warkolm · May 2, 2017, 6:18am

How many shards?
What version? What OS? What JVM?

Christian_Dahlqvist · May 2, 2017, 6:20am

You may also want to look at the documentation.

system · May 30, 2017, 6:22am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fastest way to import billions of documents? Elasticsearch	6	8703	May 3, 2017
Rapidly Degrading Bulk Indexing Performance Elasticsearch	7	368	July 6, 2017
Bulkload performance issue Elasticsearch	2	380	September 14, 2019
Suggestion needed on Indexing Performance Elasticsearch	1	496	July 6, 2017
How to increase indexing speed? Elasticsearch	5	5371	April 18, 2017

Speeding up indexing a very large file

Related topics