I want to import huge datasets in a single node ESv5.3 setup (billions of documents). Right now I'm using the latest version of ElasticDump, that uses the bulk API. Imporing takes 24h+. I see no obvious reason for this:
I'm using a 24 core server. According to 'top' the load is 400-500% (4/5 cores fully used). I expect something close to 2400%.
I'm using a local SSD RAID set. According to 'iotop' the speed is 10-40 MByte/s. I expect 100s of MBytes/s or more (disk array bechmark > 1 GByte/s)
So my server is doing near to nothing and I'm just waiting...
Things I've already tried without (significant) success:
JVM heap size = 31G (server with massive amount or RAM)
Increasing the number of ElasticDump sockets or documents per go
Increasing index interval during import.
Increasing the thread pool queue size.
The only thing that speeds up the process is pretty stupid: run several ElasticDumps in parallel...
Anyone with tips 'n' tricks about how to improve the import speed by using other settings or tools? Thanks!
I expect ElasticDump is running single _bulk inserts which are internally parallelized into once per shard. Given the default number of shards in 5 I fully expect a load average of around 5.
Step one is to use more parallel bulk loads. If that means passing something to ElasticDump or running multiple instances, fine. I figure that is an ElasticDump concern and you are on an Elasticsearch board, so you want to know about Elasticsearch things. The usual strategy is indeed multiple _bulks in parallel.
You might also play with the size of the _bulks. I don't know what ElasticDump is doing - but it is usually a good practice to shoot for 10mb or 20mb or something on each _bulk request. You can optimize up or down from there.
It is also worth looking at the mappings that your documents are making. You should set up the mapping before you import so there is no contention around mapping updates. You can reduce overhead of each document significantly (or increase it) by changing how string fields are analyzed. If happen to have any shape fields there are good settings to play with for that too.
Finally: you might get some performance benefit from running more shards. Probably some. After that you can use the _shrink API to reduce the number of shards so you don't suffer any performance penalty while querying.
I forgot to mention: I've tested using 12 and 24 shards (1 per CPU core, in and excluding HyperThreading) so expecting 1200-2400% load. Mappings, analyzers and filters are already highly optimized / stripped down.
Two questions based on your reply:
You're mentioning 10 or 20 MByte per _bulk request. Is there some way I can monitor what will work best for my specific setup? Perhaps some scripts / tools that can perform a benchmark for optimal values?
You're mentioning that multiple _bulks in parallel might be a or the way to go. Any idea why the full potential of an import is limited when using a single _bulk process? I guess that importing huge datasets in ElasticSearch is not uncommon to I expect it to be (quite) optimized.
That is something Elasticsearch expects the importing application to perform. So I'd look at ElasticDump and see if it has logging for that.
I'm not sure. Ask the ElasticDump folks. Often when folks import a ton of data into Elasticsearch they are doing it from many nodes so they get the natural parallelization from that.
I'm not bound to ElasticDump btw. All suggestions that speed up the import - both ES settings and other ways of importing - are welcome. The only reason for that I'm using ElasticDump now is that I read posts telling me that's an easy to use and fast tool ofr bulk imports. If there's a ES native way or supermagic curl options I'm also perfectly fine with that
There isn't a "native" way, really. There are filebeat and logstash, both of which are things actively worked on by Elastic folks that stick files into Elasticsearch but they are both more organized around many servers loading into one Elasticsearch so they only go with a single bulk.
Elasticsearch doesn't support any kind of streaming loads like would be required to make a supermagic curl command work. We've talked about it but never done it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.