Why the bulk api (python) has low efficiency in this way?

a943013827 · December 26, 2016, 3:28am

Hi:
I am testing the BULK insert(index) efficiency.
the python code as follow:

from elasticsearch import Elasticsearch
from elasticsearch import helpers
import time
es = Elasticsearch("127.0.0.1")
i = 0 
data_list=[]
for i in range(50000000):
    data_list.append({"_index":"stress","_type":"test","_source":{
            "collectTime": 1414709176,  
            "deltatime": 300,  
            "deviceId": "48572",  
            "getway": 0,  
            "ifindiscards": 0,  
            "ifindiscardspps": 0,  
             ...
             ...
             ...
            "ifinunknownprotos": 0,  
            "ifinunknownprotospps": 0
             }})
    if len(data_list) == 5000:
        helpers.bulk(es,data_list)
        data_list[:]=[]
if len(data_list) != 0:
    helpers.bulk(es,data_list)

but the efficiency is very low, about 2000 docs/s，and the cpu usage is only 300% (12 core),it is expected to 1200%.

when i running esrally to benchmark my elasticsearch the speed can reach 9000 docs/s , and the cpu usage can reach 1200%.

Do i miss something?

Christian_Dahlqvist · December 26, 2016, 4:36am

The size of documents and the number and types of fields will affect indexing rate as it will define the amount of work Elasticsearch need to do for each document. The reason for the low CPU usage is however probably that Rally has the ability to partition work and use multiple worker processes bulk indexing in parallell, whereas your script appear to be single threaded. Do you see better resource utilization if you increase the number of your scripts that you run concurrently?

a943013827 · December 26, 2016, 5:23am

The problem is that when I run eight scripts ,the cpu usage still at 300% - 400%, and the exception [EsRejectedExecutionException[rejected execution (queue capacity 50) ] catched.

Christian_Dahlqvist · December 26, 2016, 5:35am

Don't jump directly to 8 scripts. Instead start with 2 and slowly increase. You may also want to try with a smaller bulk size. Have you run Rally with this type of events into an index with the same number of shards?

a943013827 · December 28, 2016, 12:21pm

Hi,I found the solution，
use : es.bulk()
do not use : helpers.bulk()

the es.bulk has high efficiency.

OK,
helpers.bulk(chunk_size=5000) also has high efficiency (a little lower than es.bulk ) .

system · January 25, 2017, 12:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow bulk indexing performance Elasticsearch	6	1429	December 11, 2018
BulkProcessor Indexing Performance Elasticsearch	1	1177	November 1, 2017
What is the most efficient way to insert data to Elasticsearch? Elasticsearch	6	3478	July 18, 2018
How to speed up indexing by using Python API Elasticsearch	3	1657	July 6, 2017
Sometimes index docs very slow Elasticsearch	2	480	July 5, 2017

Why the bulk api (python) has low efficiency in this way?

Related topics