Im using ES 6.4.3 on Google Compute. Im indexing documents where each doc has 6 fields, 4 ints, 2 floats. Index has 5 shards and replication factor of 1. All fields are indexed (didnt add anything in the mapping, only type for each field) and I disabled _field_names and _all . I disabled swap files on the machine, gave ES 8GB heap with 40% index buffer size. ES is running inside a docker and the machine has 15GB ram and 500GB SSD. refresh_interval is 15m . CPU fluctuates between 10-30%, disk fluctuates between 3-12MB/s for wrties, read is way below, network is at about 2MB/s. Im using a similar code:
from elasticsearch import helpers, Elasticsearch
import csv
es = Elasticsearch(host: ['remote'], port: '9200')
with open('/tmp/x.csv') as f:
reader = csv.DictReader(f)
for resp in helpers.parallel_bulk(es, reader, index='my-index', doc_type='_doc', chunk_size=10000, thread_count=4, queue_size=20):
pass
So everything is pretty "calm" on the machine and STILL I am only able to index 13k~/s, why is that????
I didnt specify memory since its a bit hard to monitor since ES allocates the whole heap right off the bat but I dont think its the memory since I also tried setting refresh_interval to -1 , in which case ES writes to disk about every 10 million documents, still with the same rate (13k~/s).
I didnt check the script itself but it probably takes nothing (assume for now thats its not the problem, I'll update if it is).
I didnt try running multiple instance of the script since im using parallel bulk so its already running a number of instances.
Plus I tried running bulk and not parallel_bulk, same rate.
Also, Its not so clear from my first message but im running ES on only one machine, so my "cluster" consists on 1 machine.
As there is nothing that jumps out as limiting performance in Elasticsearch, a sensible first step is to eliminate the loader as the bottleneck. Please look at CPU usage and try run multiple processes in parallel to see if that makes a difference.
Turns out it was the loading script which I didnt expect since the ES guys wrote it.
Also running a number of instances helps since ever after improving the script using streaming bulk and stuff its still a bit of a bottleneck
Python does not do multithreading well, which is why I suspected this may be the case. Our benchmarking tool Rally is implemented in Python, but generates a number of processes to get around this limitation.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.