Kafka ingest performance issues

ethrbunny · August 3, 2019, 3:24pm

Im sending all my beat* data into kafka and reading it from there via logstash (running on kubernetes). I use kafka for lots of other purposes and I know we can read 1M records a second using a simple consumer. When I look at the stats for my cluster it appears we're getting ~50K records / sec.

I suspect I have something misconfigured in my logstash setup. Not sure where to look to find tips/tricks for optimizing this path.

System: all 7.3.0
5 data nodes running on kubernetes hosts with 56 cores/64G RAM
5 ingest nodes
3 master nodes

file/metric beat kafka topics have 20 partitions and 20 consumers for each - also running in kubernetes.

Christian_Dahlqvist · August 3, 2019, 4:42pm

Have you verified that Elasticsearch is not the bottleneck? What type of storage does your data nodes have?

ethrbunny · August 4, 2019, 5:03pm

Data nodes storage is from a SAN via 10G network.

How would I (dis)prove that elastic is the limiting factor?

Additional: yesterday I split out some of the higher traffic items from file/metric beat into their own kafka topics. Where my slower topics have 20 partitions / consumers these new ones only have 5/5 and seem to be performing much better.

Is this something I could solve by (properly) using pipelines?

Looking at the pipeline ui in Kibana.. a few facts surface:

The UI is only letting me see one (of 5? more?) filter-chains
For some outputs Im seeing > 25ms/event of latency. That's going to add up quickly. How do I fix this?

Christian_Dahlqvist · August 4, 2019, 5:40pm

If you got improved throughput by changing Logstash config it isquite likely that Elasticsearch is not the bottleneck. It may be something related to the Kafka input plugin design, but I do not know the internals.

ethrbunny · August 5, 2019, 11:29pm

Im pretty certain now that elasticsearch is the bottleneck.

Seeing an increasing number of errors msgs like this:

[logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of processing of [7142251][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[filebeat-2019.08.05][0]] containing [124] requests, target allocation id: aSctrLvyTpSZkzPqVIvE0A, primary term: 2 on EsThreadPoolExecutor[name = elasticsearch-data-1/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@8085181[Running, pool size = 1, active threads = 1, queued tasks = 200, completed tasks = 5601415]]"})

ethrbunny · August 6, 2019, 12:19pm

Looking at thread pools on these data nodes I found this entry:

    "write" : {
      "type" : "fixed",
      "size" : 1,
      "queue_size" : 200
    },

Is this correct? I checked the available cores on the pod - 56.

Found some info here: https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster

Not an answer - but reasons not to monkey with thread pool size.

system · September 3, 2019, 12:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Boost throughput of Kafka input Logstash	6	1851	February 24, 2021
Data delay in ELK Logstash	11	4746	February 22, 2017
Tuning Logstash for optimal throughput for ELK pipeline Logstash	4	404	March 27, 2020
Increase logstash Events Received Rate when using kafka as input Logstash	3	3074	July 29, 2017
Logstash Indexer to Elastcsearch Tunnings ( must go faster!) Logstash	14	2516	July 6, 2017

Kafka ingest performance issues

Related topics