Kafka output throughput is slow or not fast enough

We are ingesting logs of zscaler, we have come up with a bottle neck likely on the filebeat kafka output.

Deployment A:

[[ Zscaler NSS ]] -- syslog input --> [[ filebeat ]] -- kafka output --> [[ Event Hub ]] ... [[ Elastic Search ]]

This proved to be not fast enough, as the events on elastic search increased the lag by 9 hours. What we did was to quickly add logstash in the mix, which significantly increased the throughput.

Deployment B:

[[ Zscaler NSS ]] -- syslog input --> [[ filebeat ]] -- beat output --> [[ logstash ]] -- kafka output -->  [[ Event Hub ]] ... [[ Elastic Search ]]

You can clearly see on this graph how much faster the java kafka client did, compared to the go kafka client.

MicrosoftTeams-image

Has anyone experienced something similar? What else can we try in tuning the filebeat output? We tried to increase the workers. We are thinking to increase the bulk_max_size. Any other suggestions? Thanks and appreciate the feedback.

We have tweaked the filebeat agent config, this has significantly made it faster. We have picked the round robin for partition and put in 16k events before moving to another partition. Currently it defaults to 1 event for round robin it to the next.

   partition.round_robin:
     group_events: 16384

We get similar throughput as logstash (seen here is testing it)

Screen Shot 2020-10-16 at 10.23.50 AM

My question is why does filebeat default this way? I tried to use worker, bulk_max_size, and channel_buffer_size which did not change the throughput. Are there other suggestions or give me a better understanding?