Fastest way to import billions of documents?

Thanks for your quick reply!

I forgot to mention: I've tested using 12 and 24 shards (1 per CPU core, in and excluding HyperThreading) so expecting 1200-2400% load. Mappings, analyzers and filters are already highly optimized / stripped down.

Two questions based on your reply:

  • You're mentioning 10 or 20 MByte per _bulk request. Is there some way I can monitor what will work best for my specific setup? Perhaps some scripts / tools that can perform a benchmark for optimal values?

  • You're mentioning that multiple _bulks in parallel might be a or the way to go. Any idea why the full potential of an import is limited when using a single _bulk process? I guess that importing huge datasets in ElasticSearch is not uncommon to I expect it to be (quite) optimized.

Thanks for your thoughts!