I am asked to index more than 3*10^12 documents in to elastic cluster, the cluster has 50 nodes with 40 cores, and 128G of memory.
I was able to do it with _bulk in python language (multi thread) but I could not reach more than 50,000 records per second for one node.
So I want to know:
What is the fastest way to index data?
As I know, I can index data to each data node, does it grow linear? I mean I can have 50,000 records for each node?
I would recommend following the guidelines available in the documentation while making sure you tune the bulk size and distribute load across all data nodes in the cluster.
All my 50 nodes are not up now.
I could reach 50,000 with one node, I am asking if the cluster with 50 nodes can reach 50,000*50 docs per second or not?
You have specified RAM and number of cores. When indexing large volumes of data the performance of your storage and network will also play a part. What type of storage do you have? What does I/O and iowait look like while indexing? What type of network do you have in place?
The complexity and mappings of your data will also affect throughput. What is the average size of your documents? Have you optimised mappings?
When you load you also need to make sure you are able to generate enough load and that the loading process does not become a bottleneck. If you are using Python this probably means dividing up your data and distributing it across multiple loading processes and servers.
Thanks for reply, Each doc is about 500 byte, 4 disks 6TB raid 10 (about 70MB write/s), mapping is optimized, python is multi threaded using all cores.
Any thing else I should do?
Can cluster reach 50,000 * 50 docs per second?
Indexing is quite I/O intensive, so SSDs usually give better performance than spinning disks. Network throughput can also be a bottleneck, especially as the size of the cluster grows, but you did not provide any details about this.
If you are limited to gigabit networking, you may get around this by placing a coordinating-only node on the hosts you are loading from, which will act as a load balancer and allow data to be send directly to the appropriate data nodes. This reduces the amount of data that need to be sent/redirected between the data nodes, reducing network traffic.
Python does traditionally not do multithreading well, so I would recommend splitting the work across multiple processes and further across multiple machines to avoid the loading process being the bottleneck. This is what we do in our benchmarking tool Rally, which is based on Python.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.