Improving indexing rate with scale out in elasticsearch 5.x

(Mark) #1

Dear forum members,

In last two months I was working on testing elasticsearch indexing.
I have a minimum requirement to index about 50K json documents per second.
My document size is about 2-4K and has several nested defined.

I have the following configuration:

  • ElasticSearch v5.0.
  • 3 Virtual machines with 32 gb RAM (16gb as heap).
  • 8 cores
  • Redhat 7.2 64 bit
  • Basic configuration options as in best practices with memlock defined

I have tried the following approach:

  • Indexing with bulk api using python elastic library
  • Indexing without replica and then turning it on
  • Changing the number of shards as number of cores in cluster
  • Putting the data on fast SSD disks pool
  • Trying to index in parallel with 5 several threads to different nodes

With all this I could not reach above 10K/sec.
I also tried physical machines but still without noticeable result.

I have assumed that adding nodes may solve the issue as scale out solution.
I have added 3 more nodes and then even 3 more, yet the was no real impact on indexing speed.

My question for forum members,
In your experience does the scale out work for indexing?

Please share your examples with sizing of nodes and hardware specs.

Thanks in advance,

(Luca Wintergerst) #2

Hi Mark,
usually we see indexing scale out well.
How many shards do your indices have? You should have at least as many shards as you have nodes. Also make sure that they are evenly distributed. If you have 3 nodes but configure the index to have 5 shards, then two nodes will have 2 shards while 1 nodes has just 1. This causes hot spots. (make sure to take replicas into account if you have any)
Do you also see that the nodes are under pressure?
Do your clients get bulk rejections?
This would be an indicator that elasticsearch can't keep up with the indexing. If you don't see these then the bottleneck is your client or the network in between.

I have never seen elasticsearch not scale. At least not at the size you are planning to run it at.
I would recommend to dig some more before you continue with the suggestions below.

Apart from the scaling, there are things that can speed up indexing further:
disable _all
disable _source
disable norms
change index_options

all of the above have downsides, so depending on what you plan to do with that data you may not want to disable them. Make sure to read the docs to decide if the tradeoff is worth it.

If you are heavily analyzing your data it can also impact your indexing performance.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.