Fastest way to index huge data in elastic

kamal · October 14, 2018, 6:55am

I am asked to index more than 3*10^12 documents in to elastic cluster, the cluster has 50 nodes with 40 cores, and 128G of memory.
I was able to do it with _bulk in python language (multi thread) but I could not reach more than 50,000 records per second for one node.

So I want to know:

What is the fastest way to index data?
As I know, I can index data to each data node, does it grow linear? I mean I can have 50,000 records for each node?

warkolm · October 14, 2018, 7:08am

if you have 50 nodes why are you sending everything to one node?

Christian_Dahlqvist · October 14, 2018, 7:26am

I would recommend following the guidelines available in the documentation while making sure you tune the bulk size and distribute load across all data nodes in the cluster.

kamal · October 14, 2018, 7:31am

All my 50 nodes are not up now.
I could reach 50,000 with one node, I am asking if the cluster with 50 nodes can reach 50,000*50 docs per second or not?

kamal · October 14, 2018, 7:33am

Thanks for you note, I did that all, even I used parallel_bulk.
I want to know if there are any other way for that?

Christian_Dahlqvist · October 14, 2018, 7:39am

You have specified RAM and number of cores. When indexing large volumes of data the performance of your storage and network will also play a part. What type of storage do you have? What does I/O and iowait look like while indexing? What type of network do you have in place?

The complexity and mappings of your data will also affect throughput. What is the average size of your documents? Have you optimised mappings?

When you load you also need to make sure you are able to generate enough load and that the loading process does not become a bottleneck. If you are using Python this probably means dividing up your data and distributing it across multiple loading processes and servers.

kamal · October 14, 2018, 7:46am

Thanks for reply, Each doc is about 500 byte, 4 disks 6TB raid 10 (about 70MB write/s), mapping is optimized, python is multi threaded using all cores.
Any thing else I should do?
Can cluster reach 50,000 * 50 docs per second?

Christian_Dahlqvist · October 14, 2018, 8:00am

Indexing is quite I/O intensive, so SSDs usually give better performance than spinning disks. Network throughput can also be a bottleneck, especially as the size of the cluster grows, but you did not provide any details about this.

If you are limited to gigabit networking, you may get around this by placing a coordinating-only node on the hosts you are loading from, which will act as a load balancer and allow data to be send directly to the appropriate data nodes. This reduces the amount of data that need to be sent/redirected between the data nodes, reducing network traffic.

Python does traditionally not do multithreading well, so I would recommend splitting the work across multiple processes and further across multiple machines to avoid the loading process being the bottleneck. This is what we do in our benchmarking tool Rally, which is based on Python.

kamal · October 14, 2018, 9:48am

Thanks a lot, it would help me with indexing, I have found a python library named pp, which adds really good multi threading to python.

system · November 11, 2018, 9:52am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Speed Up Query of huge indices Elasticsearch	7	431	June 28, 2021
Huge Cluster Elasticsearch	5	982	July 5, 2017
Single node, large database index performance Elasticsearch	9	583	June 23, 2021
Bulk index is so faster with single data node! Elasticsearch	2	510	January 4, 2019
What is the fastest way to index 1 Billion Documents into ES Elasticsearch	7	12140	July 5, 2017

Fastest way to index huge data in elastic

Related topics