What is the most efficient way to insert data to Elasticsearch?

white_rabbit · June 20, 2018, 9:14am

I try to use Python Elasticsearch Client for inserting data to Elasticsearch however performance is terrible. I use Ubuntu 14 with 164 GB RAM and 40 processors Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz. I would like to achieve efficiency equals or close to 100000 records per second however at the moment after running this source code:

from elasticsearch import Elasticsearch

es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
es = Elasticsearch(timeout=30, max_retries=10, retry_on_timeout=True)

print("time 0: " + str(datetime.now()))

for key in range(1000):

    es.index(index='messages', doc_type='message', body={
        'message': "example message",
    })

print("time 1: " + str(datetime.now()))

I get result:

time 0: 2018-06-20 10:38:08.311971
time 1: 2018-06-20 10:38:36.154774

I tried also version with bulk in this way:

from elasticsearch import Elasticsearch
from elasticsearch import helpers

es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
es = Elasticsearch(timeout=30, max_retries=10, retry_on_timeout=True)

actions = []
for key in range(1000):

    actions.append(
      {
        "_index": "messages",
        "_type": "message",
        "_source": {
            "message": "example message"}
      }
    )
print("time 0: " + str(datetime.now()))
helpers.bulk(es, actions)
print("time 1: " + str(datetime.now()))

but the result is still very bad:

time 0: 2018-06-20 10:51:25.667748
time 1: 2018-06-20 10:51:29.757604

Any ideas how can I improve this?

Christian_Dahlqvist · June 20, 2018, 10:10am

What type of disk do you have? How many concurrent indexing threads/processes are you using? Which version of Elasticsearch are you using? Also have a look at this guide.

white_rabbit · June 20, 2018, 10:57am

cat /sys/block/sda/queue/rotational returns 0 what means SSD. At the moment I run only one thread. curl -XGET 'localhost:9200' returns "version" : { "number" : "6.2.4", ... Any ideas how to improve performance?

Christian_Dahlqvist · June 20, 2018, 11:00am

Use multiple indexing threads. You will not be able to saturate the node using a single connection. If your test records are very small you may also want to test with a larger bulk size.

white_rabbit · June 20, 2018, 11:25am

I test different bulk sizes. Can you recommend example source code with correct use of multithreading with Elasticsearch in Python?

Christian_Dahlqvist · June 20, 2018, 11:52am

When using Python it may be better to divide the work between multiple processes.

system · July 18, 2018, 11:52am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bulk is too slow Elasticsearch	34	16278	December 14, 2017
Which way to collect data into Elasticsearch is better? Elasticsearch	3	413	July 5, 2017
Indexing (insert) performance and tuning Elasticsearch	6	1317	July 6, 2017
Performance issue in record insertion in elastic search on upgrading to ElasticSearch version from 7.1.1 to 7.16.3 in linux Elasticsearch	2	190	May 25, 2022
How to improve performance of application for inserting list in elasticSearch? Elasticsearch	3	541	July 5, 2017

What is the most efficient way to insert data to Elasticsearch?

Related topics