How does indexing performance vary over increase in number of nodes?
I plan to add more nodes for performance issue. I have read from a guide that it is effective for searching, but I wasn't sure with indexing. Does it become faster?
How does indexing performance vary over increase in number of nodes?
I plan to add more nodes for performance issue. I have read from a guide that it is effective for searching, but I wasn't sure with indexing. Does it become faster?
It depends on many factors.
Like if you have only one shard, increasing the number of nodes won't change indexing throughput.
May be you could describe a bit what is your current issue?
And what is your platform? In details.
Thanks for your reply
I set up 6 or more data node instances with these settings.
number_of_shard=18
Number_of_replica=0
indices.throttle.max_bytes_per_second=512mb
How about this?
In my case shown, indexing performance are not increase.
I set up 6 or more data node instances with these settings.
6 more than what?
I think these details will be instructive to someone trying to understand your system:
In my case shown, indexing performance are not increase.
And what is that performance? How are you measuring it? Are you sure the data pipeline ahead of it isn't bottlenecked?
Sorry, I did not explain in detail
Thanks.
This could be a problem...
This isn't the problem, but you should probably reduce that to, say, 30GB, to make sure you are getting compressed ordinary object pointers.
Are you getting "backpressure" in Logstash? That is, are you getting bulk rejections from Elasticsearch because it's at its indexing capacity?
How are you invoking Logstash? (what are you providing for -w
?)
What does your Elasticsearch output configuration look like?
Again:
And what is that performance? How are you measuring it? Are you sure the data pipeline ahead of it isn't bottlenecked?
No, I have already set logstash batch size, workers, and elasticsearch output flush size enough .
It is indexing speed was not fast enough than I expected. It does not seem to changed by number of nodes. Elasticsearch seems to be a bottleneck.
Thanks.
My suggestion:
Install plugins like
bigdesk:
or
kopf
if u have not already.
Look in kopf in the first place if the index get spread evenly across all of the nodes - if not try to adjust allocation settings.
In Bigdesk u should look for the bulk thread graphs. See if the bulk-queue is getting used and if the number of bulk-threads match your cpu-cores(or even more) - u can also check there if u get heavy GC-Collection pauses.
The last thing i would recommend is enable the DEBUG-Logging in ES and check the log if merging is falling behind cause of the HDDs.
How have you determined that Elasticsearch is the bottleneck? Have you replaced the Elasticsearch output in Logstash with e.g. a stdout output with the dots codec, to verify that Logstash with your current configuration is able to achieve higher throughput than Elasticsearch can handle?
Yes, you made this hypothesis clear in your introduction. The questions I've asked are an effort to discover how you arrived at that conclusion, further test your hypothesis, and explore other (frankly, more likely) reasons for the behavior you are witnessing.
How do you use that evidence to come to that conclusion? It's precisely the reverse of how one normally goes about finding a bottleneck. You've got a process with several serial components, A -> B -> C, and the process fully processes N events/second. You decide you want the process to complete 4*N events/second, so you multiply the number of C resources by 4. In response, you still get N events/second processed. The normal suspicion is then that A or B are only capable of N events/second, not that C is broken. Instead, you are guessing that C is broken or misconfigured. What's being requested is evidence for that supposition, and, in parallel, additional information about part B, a thus far (from just the evidence volunteered) more likely bottleneck.
Let's try again.
What is your indexing speed? How are you measuring it? Are you getting backpressure (bulk rejections) from Elasticsearch? What are you using for -w
for Logstash? What does your output configuration look like?
As mentioned above, try using a "local" output (file or stdout) to test your Logstash throughput. Also, consider using the metrics filter to help measure that. Try these tips to optimize and test your Logstash configuration.
If you already have additional information that leads you to the conclusion that Elasticsearch is actually the bottleneck, please share it.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.