Slow bulk indexing performance

TDZ · November 12, 2018, 11:01pm

Im using ES 6.4.3 on Google Compute. Im indexing documents where each doc has 6 fields, 4 ints, 2 floats. Index has 5 shards and replication factor of 1. All fields are indexed (didnt add anything in the mapping, only type for each field) and I disabled _field_names and _all . I disabled swap files on the machine, gave ES 8GB heap with 40% index buffer size. ES is running inside a docker and the machine has 15GB ram and 500GB SSD. refresh_interval is 15m . CPU fluctuates between 10-30%, disk fluctuates between 3-12MB/s for wrties, read is way below, network is at about 2MB/s. Im using a similar code:

from elasticsearch import helpers, Elasticsearch
import csv

es = Elasticsearch(host: ['remote'], port: '9200')

with open('/tmp/x.csv') as f:
    reader = csv.DictReader(f)
    for resp in helpers.parallel_bulk(es, reader, index='my-index', doc_type='_doc', chunk_size=10000, thread_count=4, queue_size=20):
        pass

So everything is pretty "calm" on the machine and STILL I am only able to index 13k~/s, why is that????

I didnt specify memory since its a bit hard to monitor since ES allocates the whole heap right off the bat but I dont think its the memory since I also tried setting refresh_interval to -1 , in which case ES writes to disk about every 10 million documents, still with the same rate (13k~/s).

Christian_Dahlqvist · November 13, 2018, 7:58am

How much CPU does the loading script use? Have you tried splitting the input into multiple files and running more than one loading script in parallel?

TDZ · November 13, 2018, 8:17am

I didnt check the script itself but it probably takes nothing (assume for now thats its not the problem, I'll update if it is).
I didnt try running multiple instance of the script since im using parallel bulk so its already running a number of instances.
Plus I tried running bulk and not parallel_bulk, same rate.
Also, Its not so clear from my first message but im running ES on only one machine, so my "cluster" consists on 1 machine.

Christian_Dahlqvist · November 13, 2018, 8:21am

As there is nothing that jumps out as limiting performance in Elasticsearch, a sensible first step is to eliminate the loader as the bottleneck. Please look at CPU usage and try run multiple processes in parallel to see if that makes a difference.

TDZ · November 13, 2018, 11:47am

Turns out it was the loading script which I didnt expect since the ES guys wrote it.
Also running a number of instances helps since ever after improving the script using streaming bulk and stuff its still a bit of a bottleneck

Christian_Dahlqvist · November 13, 2018, 11:56am

Python does not do multithreading well, which is why I suspected this may be the case. Our benchmarking tool Rally is implemented in Python, but generates a number of processes to get around this limitation.

system · December 11, 2018, 11:56am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow bulk indexing Elasticsearch	4	2081	July 5, 2017
Elasticsearch poor indexing performance Elasticsearch	6	852	December 1, 2017
Elasticsearch Indexing Issues Elasticsearch	2	246	March 29, 2023
Memory problem Elasticsearch	4	515	July 6, 2017
How i to analysis the es cluster? Elasticsearch	7	622	July 5, 2017

Slow bulk indexing performance

Related topics