I want to import huge datasets in a single node ESv5.3 setup (billions of documents). Right now I'm using the latest version of ElasticDump, that uses the bulk API. Imporing takes 24h+. I see no obvious reason for this:
- I'm using a 24 core server. According to 'top' the load is 400-500% (4/5 cores fully used). I expect something close to 2400%.
- I'm using a local SSD RAID set. According to 'iotop' the speed is 10-40 MByte/s. I expect 100s of MBytes/s or more (disk array bechmark > 1 GByte/s)
So my server is doing near to nothing and I'm just waiting...
Things I've already tried without (significant) success:
- JVM heap size = 31G (server with massive amount or RAM)
- Increasing the number of ElasticDump sockets or documents per go
- Increasing index interval during import.
- Increasing the thread pool queue size.
The only thing that speeds up the process is pretty stupid: run several ElasticDumps in parallel...
Anyone with tips 'n' tricks about how to improve the import speed by using other settings or tools? Thanks!