Kafka optimal configuration


(Allen Chan) #1

I am trying to use Kafka to replace redis.
I am curious if anyone can help with this question

I have three brokers that would have 4 topics. Each topic would have dedicated three logstash indexers.

My question should the topics have 3 partitions and each indexer goes to 1 partition with 1 consumer_thread? or should expand it a little bit. IE topic has 6 partitions (2 per broker) and each indexer has 2 consumer_threads?

i know the total # of consumer_threads should equal total partitions of a topic but dont know what scenario leads to higher throughput.

Any guidance would be great


(Joe Lawson) #2

Your primary bottleneck is likely going to be Elasticsearch. Also, it would
help if you let us know 1) cloud or data center and 2) cpu and memory
sizes. I am assuming you are in a data center. With that said, lets
consider the questions. First, always leave the logstash-input-kafka
consumer_threads at 1. That option should be removed and I keep saying it
here but don't do a PR, so apologies for that. The short explanation here
is that Kafka, in the background, creates a consumer thread for each
partition it has assigned. Each consumer thread pushes its messages in the
internal buffer into Logstash's processing chain (input>filter>output).
Because each consumer thread has it's own collection of threads there is no
need to add the additional "consumer_threads" as threading is already
happening.

That aside, the scenarios we are looking at are three partitions with three
consumers or six partitions with three consumers. The primary reason to
have more partitions than the maximum number of consumers you are running
is to allow more throughput later if you added more consumers. For example
if you ran six logstash servers each with a logstash-input-kafka consumer
instead of just three. Raising the number of partitions beyond three when
you are only running with three consumers will likely not give you much of
a benefit from the consumption perspective, in fact it'll probably slow
things down as there will be parallel context switches going on in the
background. Also consider that Kafka is just streaming offsets from the
brokers so breaking up partitions can causes more disk IO thrashing if the
events are not in cache.

Please keep in mind that "slowing down" is in PC speeds, not human so the
impact could be negligible. Also if your Kafka boxes are running SSDs and a
ton of memory, more partitions isn't going to hurt much and can give you
headroom for the future. Consider scanarios where many systems are
producing messages for Kafka, having more partitions can spread the load
across the brokers.

One scenario where you may want to have more partitions is if your logstash
server has multiple CPU cores. You would probably want 1 to 2 cores for
logstash and then match the rest for kafka. IE if each box is a 4 core, you
may want to consider a setup of 2x partitions per logstash instance. So six
partitions with three four core machines. you would have two cores handling
all the kafka threads and two filtering and talking to Elastchsearch.

If you are on the cloud, I generally suggest using 1 CPU machines and then
set your logstash autoscaling group to scale on CPU consumption with a
minimum of 1 and a max equal to the number of partitions. Then your setup
scales on load. When peak is hitting more instances will fire, etc.

Hope this helps!

Sincerely,

Joe Lawson


(system) #3