How does indexing performance vary over increase in number of nodes?

How does indexing performance vary over increase in number of nodes?

I plan to add more nodes for performance issue. I have read from a guide that it is effective for searching, but I wasn't sure with indexing. Does it become faster?

1 Like

It depends on many factors.
Like if you have only one shard, increasing the number of nodes won't change indexing throughput.

May be you could describe a bit what is your current issue?
And what is your platform? In details.

Thanks for your reply :slight_smile:

I set up 6 or more data node instances with these settings.

number_of_shard=18
Number_of_replica=0
indices.throttle.max_bytes_per_second=512mb

How about this?

In my case shown, indexing performance are not increase.

I set up 6 or more data node instances with these settings.

6 more than what?

I think these details will be instructive to someone trying to understand your system:

  • version of ES
  • how many client/data nodes before and after
  • how many CPU cores per node
  • how many indices
  • how many shards per index
  • what kind of storage on data nodes
  • how much RAM per node, how much heap
  • do you have paging/swapping disabled
  • what is your indexing client, and how is it configured (posting all to one node, or all nodes, or ?)

In my case shown, indexing performance are not increase.

And what is that performance? How are you measuring it? Are you sure the data pipeline ahead of it isn't bottlenecked?

Sorry, I did not explain in detail :frowning:

  • version of ES
    : 2.3.4
  • how many client/data nodes before and after
    : start with 1 client node, 1 data node. and sequentially increase number of data node to 6.
  • how many CPU cores per node
    : 32 core with hyper threading
  • how many indices
    : for 1 index.
  • how many shards per index
    : 18 shards
  • what kind of storage on data nodes
    : SAS HDD
  • how much RAM per node, how much heap
    : 64gb RAM per nodes and 32gb heap size.
  • do you have paging/swapping disabled
    : disabled
  • what is your indexing client, and how is it configured (posting all to one node, or all nodes, or ?)
    : single logstash -> 1 client node on same host -> other data nodes on other hosts.

Thanks. :smiley:

This could be a problem...

This isn't the problem, but you should probably reduce that to, say, 30GB, to make sure you are getting compressed ordinary object pointers.

Are you getting "backpressure" in Logstash? That is, are you getting bulk rejections from Elasticsearch because it's at its indexing capacity?

How are you invoking Logstash? (what are you providing for -w?)
What does your Elasticsearch output configuration look like?

Again:

And what is that performance? How are you measuring it? Are you sure the data pipeline ahead of it isn't bottlenecked?

1 Like

No, I have already set logstash batch size, workers, and elasticsearch output flush size enough .

It is indexing speed was not fast enough than I expected. It does not seem to changed by number of nodes. Elasticsearch seems to be a bottleneck.

Thanks.

My suggestion:

Install plugins like
bigdesk:

or
kopf

if u have not already.

Look in kopf in the first place if the index get spread evenly across all of the nodes - if not try to adjust allocation settings.

In Bigdesk u should look for the bulk thread graphs. See if the bulk-queue is getting used and if the number of bulk-threads match your cpu-cores(or even more) - u can also check there if u get heavy GC-Collection pauses.

The last thing i would recommend is enable the DEBUG-Logging in ES and check the log if merging is falling behind cause of the HDDs.

How have you determined that Elasticsearch is the bottleneck? Have you replaced the Elasticsearch output in Logstash with e.g. a stdout output with the dots codec, to verify that Logstash with your current configuration is able to achieve higher throughput than Elasticsearch can handle?

1 Like

Yes, you made this hypothesis clear in your introduction. The questions I've asked are an effort to discover how you arrived at that conclusion, further test your hypothesis, and explore other (frankly, more likely) reasons for the behavior you are witnessing.

How do you use that evidence to come to that conclusion? It's precisely the reverse of how one normally goes about finding a bottleneck. You've got a process with several serial components, A -> B -> C, and the process fully processes N events/second. You decide you want the process to complete 4*N events/second, so you multiply the number of C resources by 4. In response, you still get N events/second processed. The normal suspicion is then that A or B are only capable of N events/second, not that C is broken. Instead, you are guessing that C is broken or misconfigured. What's being requested is evidence for that supposition, and, in parallel, additional information about part B, a thus far (from just the evidence volunteered) more likely bottleneck.

Let's try again.

What is your indexing speed? How are you measuring it? Are you getting backpressure (bulk rejections) from Elasticsearch? What are you using for -w for Logstash? What does your output configuration look like?

As mentioned above, try using a "local" output (file or stdout) to test your Logstash throughput. Also, consider using the metrics filter to help measure that. Try these tips to optimize and test your Logstash configuration.

If you already have additional information that leads you to the conclusion that Elasticsearch is actually the bottleneck, please share it.

3 Likes