I have about a terabyte of data I need to index on a weekly or so basis. The data is JSON blobs separated by endlines.
I have written an import script in Node.js that ingests the file and requests 40 parallel index operations to 3 elasticsearch hosts.
There are about 14.6 million records in total, and my import job is currently running at about 12,500 records per minute. At this rate it will take about 20 hours to import the whole file. I can scan the file much faster than this (2.4 mil records in 1.4 min), so I know the bottleneck in this process is the Elasticsearch index.
Currently the three Elasticsearch servers are running at 60%/30%/30% CPU. So it doesn't seem like I'm saturating their capacity. I'm running the import job from the server that's at 60%, so the data isn't moving very far.
Anyone have tips I could try to speed this up? Should I perhaps try buffering them into bulk operations?
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
If I could knock this from 20H down to 10H, that would be a huge win.
EDIT: I should note that I'm only indexing a small percentage of the data in the file. So I think it's more about the request volume than the data size.