How does indexing performance vary over increase in number of nodes?

sw.jung · July 24, 2016, 4:02pm

I plan to add more nodes for performance issue. I have read from a guide that it is effective for searching, but I wasn't sure with indexing. Does it become faster?

dadoonet · July 24, 2016, 4:36pm

It depends on many factors.
Like if you have only one shard, increasing the number of nodes won't change indexing throughput.

May be you could describe a bit what is your current issue?
And what is your platform? In details.

sw.jung · July 24, 2016, 11:03pm

Thanks for your reply

I set up 6 or more data node instances with these settings.

number_of_shard=18
Number_of_replica=0
indices.throttle.max_bytes_per_second=512mb

How about this?

In my case shown, indexing performance are not increase.

DiscussBuster · July 25, 2016, 12:06am

I set up 6 or more data node instances with these settings.

6 more than what?

I think these details will be instructive to someone trying to understand your system:

version of ES
how many client/data nodes before and after
how many CPU cores per node
how many indices
how many shards per index
what kind of storage on data nodes
how much RAM per node, how much heap
do you have paging/swapping disabled
what is your indexing client, and how is it configured (posting all to one node, or all nodes, or ?)

In my case shown, indexing performance are not increase.

And what is that performance? How are you measuring it? Are you sure the data pipeline ahead of it isn't bottlenecked?

sw.jung · July 25, 2016, 12:59am

Sorry, I did not explain in detail

version of ES
: 2.3.4
how many client/data nodes before and after
: start with 1 client node, 1 data node. and sequentially increase number of data node to 6.
how many CPU cores per node
: 32 core with hyper threading
how many indices
: for 1 index.
how many shards per index
: 18 shards
what kind of storage on data nodes
: SAS HDD
how much RAM per node, how much heap
: 64gb RAM per nodes and 32gb heap size.
do you have paging/swapping disabled
: disabled
what is your indexing client, and how is it configured (posting all to one node, or all nodes, or ?)
: single logstash -> 1 client node on same host -> other data nodes on other hosts.

Thanks.

DiscussBuster · July 25, 2016, 1:07am

This could be a problem...

This isn't the problem, but you should probably reduce that to, say, 30GB, to make sure you are getting compressed ordinary object pointers.

Are you getting "backpressure" in Logstash? That is, are you getting bulk rejections from Elasticsearch because it's at its indexing capacity?

How are you invoking Logstash? (what are you providing for -w?)
What does your Elasticsearch output configuration look like?

Again:

And what is that performance? How are you measuring it? Are you sure the data pipeline ahead of it isn't bottlenecked?

sw.jung · July 25, 2016, 5:32am

No, I have already set logstash batch size, workers, and elasticsearch output flush size enough .

It is indexing speed was not fast enough than I expected. It does not seem to changed by number of nodes. Elasticsearch seems to be a bottleneck.

Thanks.

german23 · July 25, 2016, 5:46am

My suggestion:

Install plugins like
bigdesk:

or
kopf

if u have not already.

Look in kopf in the first place if the index get spread evenly across all of the nodes - if not try to adjust allocation settings.

In Bigdesk u should look for the bulk thread graphs. See if the bulk-queue is getting used and if the number of bulk-threads match your cpu-cores(or even more) - u can also check there if u get heavy GC-Collection pauses.

The last thing i would recommend is enable the DEBUG-Logging in ES and check the log if merging is falling behind cause of the HDDs.

Christian_Dahlqvist · July 25, 2016, 6:07am

How have you determined that Elasticsearch is the bottleneck? Have you replaced the Elasticsearch output in Logstash with e.g. a stdout output with the dots codec, to verify that Logstash with your current configuration is able to achieve higher throughput than Elasticsearch can handle?

DiscussBuster · July 25, 2016, 12:28pm

Yes, you made this hypothesis clear in your introduction. The questions I've asked are an effort to discover how you arrived at that conclusion, further test your hypothesis, and explore other (frankly, more likely) reasons for the behavior you are witnessing.

How do you use that evidence to come to that conclusion? It's precisely the reverse of how one normally goes about finding a bottleneck. You've got a process with several serial components, A -> B -> C, and the process fully processes N events/second. You decide you want the process to complete 4*N events/second, so you multiply the number of C resources by 4. In response, you still get N events/second processed. The normal suspicion is then that A or B are only capable of N events/second, not that C is broken. Instead, you are guessing that C is broken or misconfigured. What's being requested is evidence for that supposition, and, in parallel, additional information about part B, a thus far (from just the evidence volunteered) more likely bottleneck.

Let's try again.

What is your indexing speed? How are you measuring it? Are you getting backpressure (bulk rejections) from Elasticsearch? What are you using for -w for Logstash? What does your output configuration look like?

As mentioned above, try using a "local" output (file or stdout) to test your Logstash throughput. Also, consider using the metrics filter to help measure that. Try these tips to optimize and test your Logstash configuration.

If you already have additional information that leads you to the conclusion that Elasticsearch is actually the bottleneck, please share it.

Topic		Replies	Views
Adding nodes does not seem to speed up indexing Elasticsearch	8	1053	July 6, 2017
Horizontal scaling of indexing Elasticsearch	8	2120	July 5, 2017
Index performance does not increase linearly Elasticsearch	8	883	October 27, 2018
ElasticSearch Bulk indexing is not scaling Elasticsearch	7	2979	July 5, 2017
Indexing slow down when we increase data node Elasticsearch	4	797	January 18, 2017

How does indexing performance vary over increase in number of nodes?

Related topics