I am trying to determine which would be a better fit for a large bulk upload ( ~ 1 trillion items for a single index). I have tried with the http api, but its very slow and painful (it has taken a week and only inserted 112 billion items sofar). I imagine I would see a performance boost from using one of the native connectors. Which connector, Transport or Node, would give me the great performance and parallelism?
What do you expect? What is your cluster estimated throughput?
You are indexing ~1 billion docs per hour, which is a whopping 250.000 docs per second. This is fast, assuming your documents are of reasonable size (what really counts is the metric "volume per second" e.g. MB/sec) and your cluster node count is around 10-20.
So I think a run time of ~1000 hours (~42 days) for 1 trillion docs is quite reasonable.
I assume the client is not the reason for your problem. Whether node or transport, they are both equivalent from the view of API, expressiveness, and throughput- they share the same code for bulk indexing actions. Also HTTP is not much slower. HTTP adds some HTTP overhead but this can be neglected, it's around 5-10%.
Thanks for the reply. Unfortunately, 42 days is not really reasonable from a business perspective. The majority of the data being ingested is time series that will be used for aggregations. By the time the ingest was done, I'd have to do another ingest (the majority of the 2nd ingest would be more updates than new documents). I was hoping that the transport client or node client could offer more parallelism, such as pushing to individual nodes concurrently vs pushing to a single node with the http api. Are there any other settings that are tunable to allow ingestion to be on the order of 3 or days?
Indexing 1 trillion events in 3 days means an average indexing rate of almost 4 million events per second. This is a lot even for small events. Assuming that you have already followed the these guidelines, you can also try disabling replicas while indexing if you have not already done that. Indexing at that level will however require a large cluster with nodes equipped with fast disks and good amounts of CPU. Have you identified what is limiting performance so far?
Instead of indexing all the data into a single index, you may want to consider using time-based indices if you are indexing metrics. This will not necessarily improve the indexing rate, but will allow queries to efficiently target only the indices that are required when limited time periods are aggregated across.
Sure, you can push docs in bulk concurrently, on HTTP or Java API. ES supports this by the BulkProcessor class in the Java API which works with both node or transport client.
On HTTP, concurrent bulks are also supported, by the language clients.
You can not get faster by pushing to individual nodes because each indexing request will be routed internally to the target node by computing hash values. This balances the indexing over the nodes that hold shards for an index.
To achieve 3 days for a trillion, you have to estimate your cluster throughput, by adding up how many nodes you have, how many docs a node can take for indexing per time frame. Then take some measurements to find out how many push clients and receiving data server nodes you need.
Here an example of rough calculation: 1000 push nodes, 1000 data nodes, and each node reports 10.000 docs per second on bulk indexing. Cluster throughput is 10 million docs per second. This means 36 billion docs per hour and ~ 2.5 trillion in 72 hours.
Unfortunately our dev environments sofar have utilized AWS Elasticsearch service so I'm not able to adjust configurations based on the guidelines you linked to. I am in the process of spinning up a cluster on AWS but not using their elasticsearch service to get both a new version of ES and to also be able to modify configurations. I just wanted to first come with a plan of attack for the experiments. I also think I can convince the business that a week is within reason for data of this magnitude to be ingested. I will make the configuration changes described in the guidelines and test. Appreciate the help!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.