- If consumer lag on a topic is zero for all consumer groups then will metricsbeats still produce the consumergroup metrics?
consumergroup and partitions metrics are not about lag. They will always produce metrics. These metrics do contain the offsets. The partitions metricset contains the last event offset added to a topics partition and the first offset available from kafka (retention might have deleted old data). The consumergroup metricset reports the last offset commit by consumers which normally operate on batches and send last ACKed offset to Kafka/Zookeeper every so often.
You still need some math in kibana (or when querying from ES):
partition event count = partition end offset - partition start offset
topic event count = sum all(partition event count)
.
topic event count
gives you the total number of events available in kafka. Note, kafka operates more like an append only log which can be used as a queue due to keeping reader offsets.
getting a consumers lag you have to define some unique key to join/correlate the partitions and consumer group metrics on.
- The unique key for partitions metricset is:
topic
, partition
- The unique key for consumer group metricset is:
topic
, partition
, consumer group
Based on these unique keys a single partitions lag can be computed by:
consumer partition lag = partition end offset - consumer group partitions end offset
Note: of consumer group is down for a long time when can also compare the consumer groups partition offset with the partitions metricset start offset to figure if some events have been lost.
Once you have the consumer partition lag
you can compute:
total consumer partition lag = sum(consumer partition lag)
.
Also very interesting is max consumer partition lag = max(consumer partition lag)
, or building a histogram over consumer partition lag
.
All but max consumer partition lag
and a histogram should be buildable with timelion or the new visualization builder in Kibana. Computing cluster-wide lag metrics in metricbeat is the reason we want to add a cluster-mode to metricbeat. Until then, maybe it can be solved by aggregation filter in Logstash.
- Is it possible that metricbeats is seeing consumergroups ([BrokerTestProchain3 BrokerTestProchain12 BrokerTestProchain8]) but not reporting any events for them because the leaders are on other nodes? But why is metricbeats picking up those 3 consumergroups in the first place?
The Kafka API allows to collect metrics/info only from kafka nodes being the leader. Trying to get information from node-1 for partitions/consumer groups handled by node-2 will get you an error. Partitions and consumer groups can be handled by different kafka nodes at a time. That's why you need to query all kafka nodes to build up the full state. If some query fails, an error event is published by metricbeat and maybe logged by metricbeat. Have you had a look at metricbeat logs and/or index for errors?
- Should we configure the metricbeats on node1 to connect to all 5 nodes?
hosts: ["our-kafka-node1-dn-name:9093", "our-kafka-node2-dn-name:9093" , ....]
OR is it better to install metricbeats on all five nodes? Seems like that is what you are suggesting.
With edge-mode I'd suggest running metricbeat on every single host. Reason is, you also want to collect CPU, memory and disk usage. Especially disk usage with kafka retention, plus rogue client can bring you ingest to a halt if disk is full.
However metricbeats log shows that it is still just picking up those old consumer groups
DBG known consumer groups: %!(EXTRA string=[BrokerTestProchain3 BrokerTestProchain12 BrokerTestProchain8])
How did you configure metricbeat for this test?
No idea for how long kafka keeps consumer group information. Or how exactly consumer groups are stored. In kafka there is no correlation between a node being responsible for producers and one being responsible for consumer groups. In theory you can read a partition without managing a consumer group at all. You can think of consumer groups being a topic as well, producing last commit offset into the consumer group topic, which also has partitions, which another kafka node is the responsible for. Do ls
on the kafka data directory and you will find consumergroups in there. That's why you need to query all kafka nodes.
I've been testing with logstash in the past. E.g. for the producer LS instance:
input {
generator {}
}
filter {
sleep {
time => "1"
every => 1000
}
}
output {
kafka {
client_id => "lsp"
topic_id => "test1"
}
}
and consumer:
input {
kafka {
group_id => "lsc1"
topics => "test1"
}
kafka {
group_id => "lsc2"
topics => "test1"
}
kafka {
group_id => "lsc3"
topics => "test1"
}
kafka {
group_id => "lsc4"
topics => "test1"
}
kafka {
group_id => "lsc5"
topics => "test1"
}
}
output {
null {}
}
when running these tests, have metricbeat run on every kafka node or add every single kafka node in metricbeat.