In last two months I was working on testing elasticsearch indexing.
I have a minimum requirement to index about 50K json documents per second.
My document size is about 2-4K and has several nested defined.
I have the following configuration:
ElasticSearch v5.0.
3 Virtual machines with 32 gb RAM (16gb as heap).
8 cores
Redhat 7.2 64 bit
Basic configuration options as in best practices with memlock defined
I have tried the following approach:
Indexing with bulk api using python elastic library
Indexing without replica and then turning it on
Changing the number of shards as number of cores in cluster
Putting the data on fast SSD disks pool
Trying to index in parallel with 5 several threads to different nodes
With all this I could not reach above 10K/sec.
I also tried physical machines but still without noticeable result.
I have assumed that adding nodes may solve the issue as scale out solution.
I have added 3 more nodes and then even 3 more, yet the was no real impact on indexing speed.
My question for forum members, In your experience does the scale out work for indexing?
Please share your examples with sizing of nodes and hardware specs.
Hi Mark,
usually we see indexing scale out well.
How many shards do your indices have? You should have at least as many shards as you have nodes. Also make sure that they are evenly distributed. If you have 3 nodes but configure the index to have 5 shards, then two nodes will have 2 shards while 1 nodes has just 1. This causes hot spots. (make sure to take replicas into account if you have any)
Do you also see that the nodes are under pressure?
Do your clients get bulk rejections?
This would be an indicator that elasticsearch can't keep up with the indexing. If you don't see these then the bottleneck is your client or the network in between.
I have never seen elasticsearch not scale. At least not at the size you are planning to run it at.
I would recommend to dig some more before you continue with the suggestions below.
all of the above have downsides, so depending on what you plan to do with that data you may not want to disable them. Make sure to read the docs to decide if the tradeoff is worth it.
If you are heavily analyzing your data it can also impact your indexing performance.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.