Why the bulk api (python) has low efficiency in this way?

Hi:
I am testing the BULK insert(index) efficiency.
the python code as follow:

from elasticsearch import Elasticsearch
from elasticsearch import helpers
import time
es = Elasticsearch("127.0.0.1")
i = 0 
data_list=[]
for i in range(50000000):
    data_list.append({"_index":"stress","_type":"test","_source":{
            "collectTime": 1414709176,  
            "deltatime": 300,  
            "deviceId": "48572",  
            "getway": 0,  
            "ifindiscards": 0,  
            "ifindiscardspps": 0,  
             ...
             ...
             ...
            "ifinunknownprotos": 0,  
            "ifinunknownprotospps": 0
             }})
    if len(data_list) == 5000:
        helpers.bulk(es,data_list)
        data_list[:]=[]
if len(data_list) != 0:
    helpers.bulk(es,data_list) 

but the efficiency is very low, about 2000 docs/s,and the cpu usage is only 300% (12 core),it is expected to 1200%.

when i running esrally to benchmark my elasticsearch the speed can reach 9000 docs/s , and the cpu usage can reach 1200%.

Do i miss something?

The size of documents and the number and types of fields will affect indexing rate as it will define the amount of work Elasticsearch need to do for each document. The reason for the low CPU usage is however probably that Rally has the ability to partition work and use multiple worker processes bulk indexing in parallell, whereas your script appear to be single threaded. Do you see better resource utilization if you increase the number of your scripts that you run concurrently?

The problem is that when I run eight scripts ,the cpu usage still at 300% - 400%, and the exception [EsRejectedExecutionException[rejected execution (queue capacity 50) ] catched.

Don't jump directly to 8 scripts. Instead start with 2 and slowly increase. You may also want to try with a smaller bulk size. Have you run Rally with this type of events into an index with the same number of shards?

Hi,I found the solution,
use : es.bulk()
do not use : helpers.bulk()

the es.bulk has high efficiency.

OK,
helpers.bulk(chunk_size=5000) also has high efficiency (a little lower than es.bulk ) .

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.