Filebeat -> Kafka How to best define list of hosts (IPs vs DNS vs Zookeeper)

After a quick look into the open issues, it doesn't look like there is any direction of having Filebeat talk directly to Zookeeper instead of listing all IPs to the cluster. (Since Zookeeper is the source of truth for host IPs) - I very well could have overlooked this ...

Is the current expectation to list all the IPs in ones cluster in the hosts attribute?

Can I assume Filebeat has the ability to rotate though the IPs until it find ones that will provide the proper metadata?

If none of the IPs can produce the proper cluster metadata, does Filebeat error out? Therefore, I could say if the filebeat process is running then it was able to push data to Kafka based on one the list of IPs given.

I can envision a world where DNS would be great here ... (or the ablity to provide a zookeeper path.
My concern is the first IP provided by DNS could be "bad". Maybe this isn't a concern since Filebeat ignores max_reties ...as long as DNS eventually gives me a good IP and the cluster is healthy.

1 Like

the kafka hosts configured in kafka output are used for bootstrapping. These initial brokers are used to read the cluster metadata (containing brokers + topic/partition assignments). Connections are finally made based on these metadata.

Filebeat will retry to connect to cluster if kafka is not available/reachable.

My understanding is Zookeeper is the version of the truth when trying to interactive with a kafka cluster. (i.e. number of healthy brokers and their IPs)

If one is not going to use Zookeeper is there some guarantee Filebeat can promise? Or is the expectations that I should manager the list of IPs in my cluster and provide them to Filebeat. Therefore if my cluster changes, I need a way to updates this list dynamically.

I am trying to understand which risk(s) I might be exposing myself to by not talking to Kafka with Zookeeper. Or simply a better understanding on how Filebeat reduces the risk when not talking to Zookeeper. Maybe there has been some recent changes to Kafka that doesn't make talking to Zookeeper as critical.

Ideally, I don't wan't to manage all the IPs in my cluster. :]

Thanks for you time in advance,

JG

following back to see about above question cc @steffens :smile:

When using Kafka the producers don't talk to zookeeper, but query metadata directly from kafka. This also simplifies logic in producers a lot, as metadata API contains all information without clients having to rummage zookeeper every now and then. See kafka docs about producer. Partitions in kafka are replicated and the leader of one partition is re-elected now and then. If leader is not available anymore (or leader returns error saying it's not leader anymore), the kafka producer has to ask for the new leader by querying a broker (by random) for the current meta-data (which should be in sync with metadata in zookeeper). See kafka doc regarding replication and leader election and consequences for client in protocol guide. Given kafka still uses zookeeper and metadata should be in sync between all brokers (either via consensus or by relaying meta to zookeeper), I don't see any major problems here.
This is pretty similar to kafka java API by kafka project itself. If there are any downside (besides bootstrap brokers being down) querying metadata via broker instead of zookeeper, the kafka devs should know best.

The metadata (queried via broker) do contain the addresses as being advertised by the individual brokers. That is, one still has to be careful not to advertise the wrong host name (e.g. localhost).

For consumers the offsets can be managed either via zookeeper or via kafka: http://kafka.apache.org/documentation.html#impl_offsettracking . Only having consumer groups, coordination should be done via zookeeper (see docs). Consumers are even discouraged to track offsets via zookeeper, as this is considered deprecated. But, beats being producer only we don't need to take care for this.

To me it seems like kafka is going to discourage clients to use zookeeper in favour of talking to kafka directly (maybe more so un upcoming releases).

Still one wants to configure multiple kafka brokers for bootstrapping the connection - partition finding - process. If bootrapping-brokers are down, clients won't be able to connect to kafka cluster.

1 Like

This topic was automatically closed after 21 days. New replies are no longer allowed.