Kafka ingest performance issues

Im sending all my beat* data into kafka and reading it from there via logstash (running on kubernetes). I use kafka for lots of other purposes and I know we can read 1M records a second using a simple consumer. When I look at the stats for my cluster it appears we're getting ~50K records / sec.

I suspect I have something misconfigured in my logstash setup. Not sure where to look to find tips/tricks for optimizing this path.

System: all 7.3.0
5 data nodes running on kubernetes hosts with 56 cores/64G RAM
5 ingest nodes
3 master nodes

file/metric beat kafka topics have 20 partitions and 20 consumers for each - also running in kubernetes.

Have you verified that Elasticsearch is not the bottleneck? What type of storage does your data nodes have?

Data nodes storage is from a SAN via 10G network.

How would I (dis)prove that elastic is the limiting factor?

Additional: yesterday I split out some of the higher traffic items from file/metric beat into their own kafka topics. Where my slower topics have 20 partitions / consumers these new ones only have 5/5 and seem to be performing much better.

Is this something I could solve by (properly) using pipelines?

Looking at the pipeline ui in Kibana.. a few facts surface:

  1. The UI is only letting me see one (of 5? more?) filter-chains
  2. For some outputs Im seeing > 25ms/event of latency. That's going to add up quickly. How do I fix this?

If you got improved throughput by changing Logstash config it isquite likely that Elasticsearch is not the bottleneck. It may be something related to the Kafka input plugin design, but I do not know the internals.

Im pretty certain now that elasticsearch is the bottleneck.

Seeing an increasing number of errors msgs like this:

[logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of processing of [7142251][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[filebeat-2019.08.05][0]] containing [124] requests, target allocation id: aSctrLvyTpSZkzPqVIvE0A, primary term: 2 on EsThreadPoolExecutor[name = elasticsearch-data-1/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@8085181[Running, pool size = 1, active threads = 1, queued tasks = 200, completed tasks = 5601415]]"})
1 Like

Looking at thread pools on these data nodes I found this entry:

    "write" : {
      "type" : "fixed",
      "size" : 1,
      "queue_size" : 200

Is this correct? I checked the available cores on the pod - 56.

Found some info here: https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster

Not an answer - but reasons not to monkey with thread pool size.