What is the fastest way to index 1 Billion Documents into ES

For a sample document of my requirement, it take 250 seconds to index 1 million docs in my mac (8GB Ram, 512GB HDD).I use java bulk processor API. Leaving aside the hardware, what is the fastest way that I can index 1 Billion Docs. Is it possible to do this in a matter of few hours I run in AWS 8X Large machine?

Can you elaborate why you want to leave hardware aside, and then ask explicitly about AWS hardware!?

What do the documents look like and what configuration do you choose (setting, mapping)?

And what is "few" in few hours? It all depends.

If you want to know about AWS performance, just run your tests there for yourself, and find out.

Well I told AWS as an example to have a large machine. I wanted to know the best possible way so that the time depends on the hardware capability alone and not on the method of ingestion. The assumption is that the best possible way to index in my mac should be the best possible way to index in some other big machine as well. (When I mean big machine, I also mean bigger heap space etc).

Well few in few hours is just a means to ask whether can I index 1 Billion records just within a day. All the methods I have tried so far in my local machine extrapolates to few days for indexing. I was wondering if I bring this duration within a day. So that when I use a bigger machine, the duration will still be less.

In summary, Can you tell me the fastest way to index 1 billion records and what is the best possible hardware to do it?

There are some rules. This is known as scaling rules. ES is great in scaling linearly. Other search software is not so great.

  1. The more machines, the higher the thoughput while indexing. That is, more client machines, more cluster node machines. There is an edge case where the index volume exceed network transport capacity.
  2. Heap space has not a direct relation to indexing speed. It is for processing large documents. What helps are large file system buffers and fast I/O disks.
  3. The more powerful the CPU cores, the higher is the indexing thoughput per node (index volume per second)
  4. The measuring of documents per seconds is not that meaningful as the measuring of bytes per second .
  5. most important rule, the faster the slowest part of the overall system is (which is theI/O subsystem), the faster is the ES indexing

So you should calculate how much bytes per time frame are one billion docs. Then you can measure one node indexing (using a Mac with other specs than your real production server does not tell you the truth) and from that, you can run scaling experiments that tell you the scaling factor, i.e. if two nodes double the indexing throughput with regard to one node, if three nodes triple the throughput etc.

Then you can find out how much resources you need.

1 Like

Thanks for the detailed explanation Jörg Prante. I will first measure the bytes per frame.

Also look at these guidelines for optimizing Elasticsearch indexing performance.

First of all, bulk processing would speed up the process even with the default settings in ES. It seems you are doing that.

I suggest to take a look at BulkProcessor from ES in the link below. I use the TransportClient with BulkProcessor.

https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/java-docs-bulk-processor.html

This one allows you to control the bulk actions, bulk size, flush interval and concurrent requests for the type of HW you are using. I don't think there is one solution for all, at least from my experience with ES.