What is the most efficient way to insert data to Elasticsearch?


(White Rabbit) #1

I try to use Python Elasticsearch Client for inserting data to Elasticsearch however performance is terrible. I use Ubuntu 14 with 164 GB RAM and 40 processors Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz. I would like to achieve efficiency equals or close to 100000 records per second however at the moment after running this source code:

from elasticsearch import Elasticsearch

es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
es = Elasticsearch(timeout=30, max_retries=10, retry_on_timeout=True)

print("time 0: " + str(datetime.now()))

for key in range(1000):

    es.index(index='messages', doc_type='message', body={
        'message': "example message",
    })

print("time 1: " + str(datetime.now()))

I get result:

time 0: 2018-06-20 10:38:08.311971
time 1: 2018-06-20 10:38:36.154774

I tried also version with bulk in this way:

from elasticsearch import Elasticsearch
from elasticsearch import helpers

es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
es = Elasticsearch(timeout=30, max_retries=10, retry_on_timeout=True)

actions = []
for key in range(1000):

    actions.append(
      {
        "_index": "messages",
        "_type": "message",
        "_source": {
            "message": "example message"}
      }
    )
print("time 0: " + str(datetime.now()))
helpers.bulk(es, actions)
print("time 1: " + str(datetime.now()))

but the result is still very bad:

time 0: 2018-06-20 10:51:25.667748
time 1: 2018-06-20 10:51:29.757604

Any ideas how can I improve this?


(Christian Dahlqvist) #2

What type of disk do you have? How many concurrent indexing threads/processes are you using? Which version of Elasticsearch are you using? Also have a look at this guide.


(White Rabbit) #3

cat /sys/block/sda/queue/rotational returns 0 what means SSD. At the moment I run only one thread. curl -XGET 'localhost:9200' returns "version" : { "number" : "6.2.4", ... Any ideas how to improve performance?


(Christian Dahlqvist) #4

Use multiple indexing threads. You will not be able to saturate the node using a single connection. If your test records are very small you may also want to test with a larger bulk size.


(White Rabbit) #5

I test different bulk sizes. Can you recommend example source code with correct use of multithreading with Elasticsearch in Python?


(Christian Dahlqvist) #6

When using Python it may be better to divide the work between multiple processes.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.