What is the most efficient way to insert data to Elasticsearch?

I try to use Python Elasticsearch Client for inserting data to Elasticsearch however performance is terrible. I use Ubuntu 14 with 164 GB RAM and 40 processors Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz. I would like to achieve efficiency equals or close to 100000 records per second however at the moment after running this source code:

from elasticsearch import Elasticsearch

es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
es = Elasticsearch(timeout=30, max_retries=10, retry_on_timeout=True)

print("time 0: " + str(datetime.now()))

for key in range(1000):

    es.index(index='messages', doc_type='message', body={
        'message': "example message",
    })

print("time 1: " + str(datetime.now()))

I get result:

time 0: 2018-06-20 10:38:08.311971
time 1: 2018-06-20 10:38:36.154774

I tried also version with bulk in this way:

from elasticsearch import Elasticsearch
from elasticsearch import helpers

es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
es = Elasticsearch(timeout=30, max_retries=10, retry_on_timeout=True)

actions = []
for key in range(1000):

    actions.append(
      {
        "_index": "messages",
        "_type": "message",
        "_source": {
            "message": "example message"}
      }
    )
print("time 0: " + str(datetime.now()))
helpers.bulk(es, actions)
print("time 1: " + str(datetime.now()))

but the result is still very bad:

time 0: 2018-06-20 10:51:25.667748
time 1: 2018-06-20 10:51:29.757604

Any ideas how can I improve this?

What type of disk do you have? How many concurrent indexing threads/processes are you using? Which version of Elasticsearch are you using? Also have a look at this guide.

cat /sys/block/sda/queue/rotational returns 0 what means SSD. At the moment I run only one thread. curl -XGET 'localhost:9200' returns "version" : { "number" : "6.2.4", ... Any ideas how to improve performance?

Use multiple indexing threads. You will not be able to saturate the node using a single connection. If your test records are very small you may also want to test with a larger bulk size.

I test different bulk sizes. Can you recommend example source code with correct use of multithreading with Elasticsearch in Python?

When using Python it may be better to divide the work between multiple processes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.