Metricbeat kafka module - consumergroup metric set does not report metrics

We are are using metricbeats with the kafka module turned on. The kafka module is configured to send both partition and consumergroup metric sets
to elastic search. We see that metricbeats is sending the partition metric set, but does not seem to be sending any consumergroup metrics to elastic search.
We are using the latest version of metric beats (5.3.1).

Looking at the debug logs in metricbeats we see the following log statement several times:

known consumer groups: %!(EXTRA []string=[BrokerTestProchain12 BrokerTestProchain8 BrokerTestProchain3]

There are two problems with the above log stmt:

  1. When we run the kafka-consumer-groups.sh --list command on the kafka node we see more than those three consumer groups on the broker. There are about 16 consumer groups on that broker but they never show up in that log stmt! It just always those three consumergroups!
  2. If metricbeats is finding those 3 consumer groups then should it not atleast find events for those consumer groups too? We see in the logs the following message:
2017-04-21T19:06:25Z INFO Non-zero metrics in the last 30s: fetches.kafka-consumergroup.success=3 fetches.kafka-partition.events=192 fetches.kafka-partition.success=3 fetches.system-cpu.events=3 fetches.system-cpu.success=3 fetches.system-filesystem.events=24 fetches.system-filesystem.success=3 fetches.system-fsstat.events=3 fetches.system-fsstat.success=3 fetches.system-load.events=3 fetches.system-load.success=3 fetches.system-memory.events=3 fetches.system-memory.success=3 fetches.system-network.events=6 fetches.system-network.success=3 fetches.system-process.events=396 fetches.system-process.success=3 libbeat.es.call_count.PublishEvents=15 libbeat.es.publish.read_bytes=10244 libbeat.es.publish.write_bytes=396098 libbeat.es.published_and_acked_events=630 libbeat.publisher.messages_in_worker_queues=630 libbeat.publisher.published_events=630

As you can see in the above message we have fetches.kafka-consumergroup.success=3 but no events for it.

Can someone please help? This stuff was looking really promising for us until we hit this snag. Its some powerful stuff and we would love to get past it. Been banging my head at it for some time now. What could we be missing?

This sounds to me like some consumers use the old client configurations, coordinating via zookeeper, but not kafka. The new client configurations will coordinate consumer groups via kafka API.

Officially there are 2 different ways for consumers to store state. State storage is required for coordinate between active clients and restart of clients (remember last offset processed). The state stored includes assignments of consumer groups to topics and last read/committed offsets. (1) The old (and mostly deprecated way) of storing state, is having the consumers manage all state and assignments themselves via zookeeper. (2) The new and recommended way (introduced in kafka 0.9) is to use the kafka conusmergroup management capabilities (must be explicitely used/configured by client) to handle all state.

Kafka and it's tools (e.g. bin/kafka-consumer-groups.sh) explicitly support both types (also for compatibility reasons between client libs). This becomes apparent by the fact one has to pass the zookeeper address to the kafka tools as well.

The kafka module in metricbeat DOES NOT access zookeeper for any information. That is, it fully relies on the kafka APIs for all collectible information. This means, the kafka module only supports consumergroups using strategy (2), but not those still using strategy (1).

I have tested with Logstash 2.4 (using old client configuration) and Logstash 5.x (using new client configuration). The consumer group metricset works only with LS 5.x here.

One of the advantages when using the "new" Kafka based API is, clients don't need to talk zookeeper anymore. This gives you the option to hide zookeeper in your network, as it's only required by kafka itself.

Steffen,
Thanks for the reply. I am a 100% positive that we are using the "new client configurations". We are using kafka 0.10 and using spring-integration java clients. These spring-integration clients do it the way you have mentioned. They store the offsets on the kafka nodes and not on zookeeper.
Also the kafka command I am using to check what consumer groups i have on the broker is:
bin/kafka-consumer-groups.sh --bootstrap-server <ourbroker dn address>:9093 --list
This gives me a list of 16 consumer groups.

Adding some more information that may be relevant. We are using SSL. BUT I think beats is configured correctly. We have provided the correct certs in the yaml file. Following is our beats yaml file with

###################### Metricbeat Configuration Example #######################

# This file is an example configuration file highlighting only the most common
# options. The metricbeat.full.yml file from the same directory contains all the
# supported options with more comments. You can use it as a reference.
#
# You can find the full configuration reference here:
# https://www.elastic.co/guide/en/beats/metricbeat/index.html

#==========================  Modules configuration ============================
metricbeat.modules:

#------------------------------- Kafka Module --------------------------------
- module: kafka
  metricsets:
    - partition
    - consumergroup
  enabled: true
  period: 10s
  hosts: ["our-kafka-node1-dn-name:9093"]

  client_id: kafka-broker-node1
  retries: 3
  backoff: 250ms

  # List of Topics to query metadata for. If empty, all topics will be queried.
  topics: []

  # Optional SSL. By default is off.
  # List of root certificates for HTTPS server verifications
  ssl.certificate_authorities: ["/opt/kafka/latest/ssl/qetestrootca.pem"]

  # Certificate for SSL client authentication
  ssl.certificate: "/opt/kafka/latest/ssl/kafka_clientstore.pem"

  # Client Certificate Key
  ssl.key: "/opt/kafka/latest/ssl/kafka_clientstore.key"


#------------------------------- System Module -------------------------------
- module: system
  metricsets:
    # CPU stats
    - cpu

    # System Load stats
    - load

    # Per CPU core stats
    #- core

    # IO stats
    #- diskio

    # Per filesystem stats
    - filesystem

    # File system summary stats
    - fsstat

    # Memory stats
    - memory

    # Network stats
    - network

    # Per process stats
    - process

    # Sockets (linux only)
    #- socket
  enabled: true
  period: 10s
  processes: ['.*']



#================================ General =====================================

# The name of the shipper that publishes the network data. It can be used to group
# all the transactions sent by a single shipper in the web interface.
#name:

# The tags of the shipper are included in their own field with each
# transaction published.
#tags: ["service-X", "web-tier"]

# Optional fields that you can specify to add additional information to the
# output.
#fields:
#  env: staging

#================================ Outputs =====================================

# Configure what outputs to use when sending the data collected by the beat.
# Multiple outputs may be used.

#-------------------------- Elasticsearch output ------------------------------
output.elasticsearch:
  # Array of hosts to connect to.
  hosts: ["ouelasticcloudip.us-east-1.aws.found.io:9243"]

  # Optional SSL. By default is off.
  ssl.certificate_authorities: ["/etc/pki/tls/cert.pem"]
  ssl.certificate: "/opt/kafka/latest/ssl/kafka_clientstore.pem"
  ssl.key: "/opt/kafka/latest/ssl/kafka_clientstore.key"

  # Optional protocol and basic auth credentials.
  protocol: "https"
  username: "blah"
  password: blah

#----------------------------- Logstash output --------------------------------
#output.logstash:
  # The Logstash hosts
  #hosts: ["localhost:5044"]

  # Optional SSL. By default is off.
  # List of root certificates for HTTPS server verifications
  #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]

  # Certificate for SSL client authentication
  #ssl.certificate: "/etc/pki/client/cert.pem"

  # Client Certificate Key
  #ssl.key: "/etc/pki/client/cert.key"

#================================ Logging =====================================

# Sets log level. The default log level is info.
# Available log levels are: critical, error, warning, info, debug
logging.level: debug

# At debug level, you can selectively enable logging only for some components.
# To enable all selectors use ["*"]. Examples of other selectors are "beat",
# "publish", "service".
#logging.selectors: ["*"]

More relevant info:
Restarted metricbeats and still seeing the following message:
2017-04-26T18:00:15Z DBG known consumer groups: %!(EXTRA []string=[BrokerTestProchain3 BrokerTestProchain12 BrokerTestProchain8])

On the kafka node i run the following command to get a list of consumer groups:
bin/kafka-consumer-groups.sh --bootstrap-server qetest-kafka-green-1.taulia.com:9093 --command-config client-ssl.properties --list
Following is the result of that command:

Note: This will only show information about consumers that use the Java consumer API (non-ZooKeeper-based consumers).

BrokerTestProchain1
BrokerTestProchain6
BrokerTestProchain10
BrokerTestProchain15
BrokerTestProchain3
BrokerTestProchain12
BrokerTestProchain8
BrokerTestProchain4
BrokerTestProchain9
BrokerTestProchain13
BrokerTestProchain5
BrokerTestProchain14
BrokerTestProchain2
BrokerTestProchain11
BrokerTestProchain7
BrokerTestProchain16

I run the following command as a follow up:
in/kafka-consumer-groups.sh --bootstrap-server qetest-kafka-green-1.taulia.com:9093 --command-config client-ssl.properties --describe --group BrokerTestProchain8
The output of this command is:

TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG        CONSUMER-ID                                      HOST                           CLIENT-ID

I ran the same command for BrokerTestProchain3 and BrokerTestProchain12 and got the same result.
I also ran the same command for the other consumer groups and got the same result.

YET...i see in the same message in metric beats logs :
2017-04-26T18:00:15Z DBG known consumer groups: %!(EXTRA []string=[BrokerTestProchain3 BrokerTestProchain12 BrokerTestProchain8])

So dont know what is going on and whats so special about those three consumers...

Question: how many kafka nodes do you have?

When monitoring kafka with metricbeat we have to consider:

  • Normally metricbeat is supposed to run on the edge nodes
  • In Kafka, the broker state (for producers) and consumer group states are not global.
  • Leader nodes in kafka can change.
  • state being local per node, one needs to collect and correlate metrics from all notes

Currently metricbeat operates in 'edge' mode. That is, it only collects the information from the nodes configured. The kafka-consumer-groups.sh script operates in cluster-mode. It uses the bootstrapping to find all kafka nodes, connects to all nodes, collects the information and prints those. This mode has an disadvantage if one broker is not reachable from the monitoring script.

The kafka module in metricbeat tries to figure out which kafka node localhost is used for. That is a configuration like this can be copied to all kafka hosts to collect broker (e.g. one can bind a custom listener in kafka to localhost only), consumer groups and system resources:

- module: kafka
  metricsets:
    - partition
    - consumergroup
  period: 10s
  hosts: ["localhost:9092"]

  client_id: "metricbeat"

- module: system
  metricset:
    - cpu
    - diskio
    - fsstat
    - memory
    - process
  processes: ['java']

One can bind an unencrypted port to localhost and a public encrypted one via:

listener=PLAINTEXT://localhost:9092,SSL://:9093
advertised.listeners=SSL://our-kafka-node1-dn-name:9093

The advantage of edge-mode is, you can collect additional system resources. Disadvantage is, the data indexed need to be correlated to ask questions like client with max lag. Total lag/queue-size can be done by summing up all stats in kibana.

Metricbeat architecture currently doesn't allow us to easily add cluster-mode. But it's definitely on the TODO list.

Steffen,
To answer your question:
we have 5 kafka nodes. However, we have only installed metricbeats on node-1. We wanted to do this to test it to see how it works. The plan is to install metric beats on all 5 nodes. But currently only node-1 has it and as you can see metricbeats.yaml is configured to read the metrics only from node-1.

Questions:

  1. If consumer lag on a topic is zero for all consumer groups then will metricsbeats still produce the consumergroup metrics?
  2. Is it possible that metricbeats is seeing consumergroups ([BrokerTestProchain3 BrokerTestProchain12 BrokerTestProchain8]) but not reporting any events for them because the leaders are on other nodes? But why is metricbeats picking up those 3 consumergroups in the first place?
  3. Should we configure the metricbeats on node1 to connect to all 5 nodes?
    hosts: ["our-kafka-node1-dn-name:9093", "our-kafka-node2-dn-name:9093" , ....]
    OR is it better to install metricbeats on all five nodes? Seems like that is what you are suggesting.

Steffen,
I ran another test by producing to a topic with 10 partitions with a replication factor of 5.
I started 3 consumergroups on the same topic. I could see by using the kafka scripts that node-1 was the leader on a couple of partitions for that topic. The names of my consumer groups are completely different now. They are: BrokerTestProchain4Rahul, BrokerTestProchain2Rahul, BrokerTestProchain3Rahul.

However metricbeats log shows that it is still just picking up those old consumer groups
DBG known consumer groups: %!(EXTRA []string=[BrokerTestProchain3 BrokerTestProchain12 BrokerTestProchain8])

metric beats does not pick up my new consumer congroups at all. However, it seems to pick up the partition metric sets just fine. But that was always the case. I see the following in the metric beat log to prove that the partition events are being published correctly

017-04-26T21:53:41Z DBG  Publish: {
  "@timestamp": "2017-04-26T21:53:41.547Z",
  "beat": {
    "hostname": "qetest-kafka-green-1.taulia.com",
    "name": "qetest-kafka-green-1.taulia.com",
    "version": "5.3.1"
  },
  "kafka": {
    "partition": {
      "broker": {
        "address": "qetest-kafka-green-1.taulia.com:9093",
        "id": 1
      },
      "offset": {
        "newest": 2456,
        "oldest": 0
      },
      "partition": {
        "id": 6,
        "insync_replica": true,
        "leader": 1,
        "replica": 4
      },
      "topic": {
        "name": "com.taulia.broker.test.1.rahul"
      }
    }
  },
  "metricset": {
    "host": "qetest-kafka-green-1.taulia.com:9093",
    "module": "kafka",
    "name": "partition",
    "rtt": 63182
  },
  "type": "metricsets"
}

I found the following post on stackoverflow which seems to be the exact thing that I am seeing:


But nobody has responded to that post

  1. If consumer lag on a topic is zero for all consumer groups then will metricsbeats still produce the consumergroup metrics?

consumergroup and partitions metrics are not about lag. They will always produce metrics. These metrics do contain the offsets. The partitions metricset contains the last event offset added to a topics partition and the first offset available from kafka (retention might have deleted old data). The consumergroup metricset reports the last offset commit by consumers which normally operate on batches and send last ACKed offset to Kafka/Zookeeper every so often.

You still need some math in kibana (or when querying from ES):

  • partition event count = partition end offset - partition start offset
  • topic event count = sum all(partition event count).

topic event count gives you the total number of events available in kafka. Note, kafka operates more like an append only log which can be used as a queue due to keeping reader offsets.

getting a consumers lag you have to define some unique key to join/correlate the partitions and consumer group metrics on.

  • The unique key for partitions metricset is: topic, partition
  • The unique key for consumer group metricset is: topic, partition, consumer group

Based on these unique keys a single partitions lag can be computed by:

consumer partition lag = partition end offset - consumer group partitions end offset

Note: of consumer group is down for a long time when can also compare the consumer groups partition offset with the partitions metricset start offset to figure if some events have been lost.

Once you have the consumer partition lag you can compute:

total consumer partition lag = sum(consumer partition lag).

Also very interesting is max consumer partition lag = max(consumer partition lag), or building a histogram over consumer partition lag.

All but max consumer partition lag and a histogram should be buildable with timelion or the new visualization builder in Kibana. Computing cluster-wide lag metrics in metricbeat is the reason we want to add a cluster-mode to metricbeat. Until then, maybe it can be solved by aggregation filter in Logstash.

  1. Is it possible that metricbeats is seeing consumergroups ([BrokerTestProchain3 BrokerTestProchain12 BrokerTestProchain8]) but not reporting any events for them because the leaders are on other nodes? But why is metricbeats picking up those 3 consumergroups in the first place?

The Kafka API allows to collect metrics/info only from kafka nodes being the leader. Trying to get information from node-1 for partitions/consumer groups handled by node-2 will get you an error. Partitions and consumer groups can be handled by different kafka nodes at a time. That's why you need to query all kafka nodes to build up the full state. If some query fails, an error event is published by metricbeat and maybe logged by metricbeat. Have you had a look at metricbeat logs and/or index for errors?

  1. Should we configure the metricbeats on node1 to connect to all 5 nodes?
    hosts: ["our-kafka-node1-dn-name:9093", "our-kafka-node2-dn-name:9093" , ....]
    OR is it better to install metricbeats on all five nodes? Seems like that is what you are suggesting.

With edge-mode I'd suggest running metricbeat on every single host. Reason is, you also want to collect CPU, memory and disk usage. Especially disk usage with kafka retention, plus rogue client can bring you ingest to a halt if disk is full.

However metricbeats log shows that it is still just picking up those old consumer groups
DBG known consumer groups: %!(EXTRA string=[BrokerTestProchain3 BrokerTestProchain12 BrokerTestProchain8])

How did you configure metricbeat for this test?

No idea for how long kafka keeps consumer group information. Or how exactly consumer groups are stored. In kafka there is no correlation between a node being responsible for producers and one being responsible for consumer groups. In theory you can read a partition without managing a consumer group at all. You can think of consumer groups being a topic as well, producing last commit offset into the consumer group topic, which also has partitions, which another kafka node is the responsible for. Do ls on the kafka data directory and you will find consumergroups in there. That's why you need to query all kafka nodes.

I've been testing with logstash in the past. E.g. for the producer LS instance:

input {
  generator {}
}

filter {
  sleep {
    time => "1"
    every => 1000
  }
}

output {
  kafka {
    client_id => "lsp"
    topic_id => "test1"
  }
}

and consumer:

input {
  kafka {
    group_id => "lsc1"
    topics => "test1"
  }
  kafka {
    group_id => "lsc2"
    topics => "test1"
  }
  kafka {
    group_id => "lsc3"
    topics => "test1"
  }
  kafka {
    group_id => "lsc4"
    topics => "test1"
  }
  kafka {
    group_id => "lsc5"
    topics => "test1"
  }
}

output {
  null {}
}

when running these tests, have metricbeat run on every kafka node or add every single kafka node in metricbeat.

Steffen,
First of... thankyou so much for your detailed response. Especially the part about the kibana math. Once we
get past the hurdle of collecting the data we will start to look at that aspect. Its good to have that info already!

You asked if we see any errors being logged by metric beats. The only thing we see in the metric beats logs
is the following 'Debug' stmt:
DBG broker is not leader (broker=1, leader=2)
But I looked at the beats code and that line comes from the partition metricset collector....so thats not a problem i guess. We don't see any error events being published to our elastic search cluster.

You have made the following statement about consumer groups:
You can think of consumer groups being a topic as well, producing last commit offset into the consumer group topic, which also has partitions, which another kafka node is the responsible for.
Based on that statement it could be that even if node-1 was the leader for some topic partitions, it
was not the manager of the consumer groups leadership and hence we were are not seeing it
collect any consumergroup data.

So... as a follow up we will install metricbeats on all 5 nodes with the kafka module configured on all 5 nodes. Rerun our test scenario and see if we pick up consumerGroup data.

Btw, thanks for providing that simple logstash producer/consumer example. Quite useful for testing!

Thanks again for you response. will keep you posted!

Right, this is what I'm assuming is happening.

For doing some math at ingest time, There is this new logstash filter in the making which I hope can be used/scripted to correlate and precompute some consumer lag stats for kafka.

Steffen
So we finally we able to get metric beats installed on all five kafka nodes. We ran the same test and we still dont see data from the consumergroup metrics being reported. How can i debug this? We have turned on debug level logging in metricbeats. We dont see any error events being reported

I am running the following command to see what consumer groups are being reported on node4:
bin/kafka-consumer-groups.sh --bootstrap-server qetest-kafka-green-4.taulia.com:9093 --command-config /home/rahul.joshi/client-ssl.properties --list
This gives me:

BrokerTestProchain3
BrokerTestProchain12
BrokerTestProchain8
BrokerTestProchain5
BrokerTestProchain3Rahul
BrokerTestProchain14
BrokerTestProchain1
BrokerTestProchain6
BrokerTestProchain10
BrokerTestProchain15
BrokerTestProchain4Rahul
BrokerTestProchain2
BrokerTestProchain11
BrokerTestProchain7
BrokerTestProchain16
BrokerTestProchain4
BrokerTestProchain9
BrokerTestProchain13
BrokerTestProchain2Rahul

I also ran a command on the kafka broker node 4 to make sure that it was a leader for some of the partitions for the __consumer_offsets topic. Which it is. So according to your theory I should see consumergroup metrics.

Is there any other way I should look for consumergroups on the broker? You had mentioned looking at the kafka logs directory. But in the logs directory shows the following:

com.taulia.broker.test.1.rahul-7  com.taulia.test.config-9  __consumer_offsets-2   __consumer_offsets-33  __consumer_offsets-44  recovery-point-offset-checkpoint
com.taulia.broker.test.1.rahul-0  com.taulia.broker.test.1.rahul-8  __consumer_offsets-1      __consumer_offsets-21  __consumer_offsets-34  __consumer_offsets-45  replication-offset-checkpoint
com.taulia.broker.test.1.rahul-1  com.taulia.broker.test.1.rahul-9  __consumer_offsets-10     __consumer_offsets-22  __consumer_offsets-37  __consumer_offsets-46
com.taulia.broker.test.1.rahul-2  com.taulia.test.config-0          __consumer_offsets-13     __consumer_offsets-24  __consumer_offsets-38  __consumer_offsets-49
com.taulia.broker.test.1.rahul-3  com.taulia.test.config-1          __consumer_offsets-14     __consumer_offsets-25  __consumer_offsets-39  __consumer_offsets-5
com.taulia.broker.test.1.rahul-4  com.taulia.test.config-4          __consumer_offsets-17     __consumer_offsets-26  __consumer_offsets-4   __consumer_offsets-6
com.taulia.broker.test.1.rahul-5  com.taulia.test.config-5          __consumer_offsets-18     __consumer_offsets-29  __consumer_offsets-41  __consumer_offsets-9
com.taulia.broker.test.1.rahul-6  com.taulia.test.config-8          __consumer_offsets-19     __consumer_offsets-30  __consumer_offsets-42  meta.properties

I also ran the following command on node4:
bin/kafka-consumer-groups.sh --bootstrap-server qetest-kafka-green-4.taulia.com:9093 --command-config /home/rahul.joshi/client-ssl.properties --describe --group BrokerTestProchain3Rahul

got the following:

com.taulia.broker.test.1.rahul 0          2425            2425            0          BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-1-<GUID><IP> BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-1
com.taulia.broker.test.1.rahul 7          2555            2555            0          BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-7-<GUID><IP> BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-7
com.taulia.broker.test.1.rahul 9          2569            2569            0          BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-9-<GUID><IP> BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-9
com.taulia.broker.test.1.rahul 8          2495            2495            0          BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-8-<GUID><IP> BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-8
com.taulia.broker.test.1.rahul 6          2528            2528            0          BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-6-<GUID><IP> BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-6
com.taulia.broker.test.1.rahul 5          2459            2459            0          BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-5-<GUID><IP> BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-5
com.taulia.broker.test.1.rahul 1          2486            2486            0          BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-10-<GUID><IP> BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-10
com.taulia.broker.test.1.rahul 2          2503            2503            0          BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-2-<GUID><IP> BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-2
com.taulia.broker.test.1.rahul 3          2468            2468            0          BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-3-<GUID><IP> BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-3
com.taulia.broker.test.1.rahul 4          2512            2512            0          BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-4-<GUID><IP> BrokerTestProchain3_com.taulia.broker.test.1.rahul_BrokerTestProchain3Rahul-4

More info:..
I turned on debug logging for metricbeats on each kafka node. The metricbeats on each node seem like they are picking up known consumer groups ... but unfortunately like i said no events for the consumergroup metricset are being generated. For example :

On node 5 i see the following in the logs:

2017-05-03T00:21:14Z DBG  known consumer groups: %!(EXTRA []string=[BrokerTestProchain10 BrokerTestProchain15 BrokerTestProchain4Rahul BrokerTestProchain1 BrokerTestProchain6])

On node 4 i see the following in the logs:

2017-05-03T00:25:02Z DBG  known consumer groups: %!(EXTRA []string=[BrokerTestProchain5 BrokerTestProchain3Rahul BrokerTestProchain14])

no error events....

Note that in our metric beat configuration we still just have each metricbeat process configured to talk to only one kafka node.

So node 4 will have ....

- module: kafka
  metricsets:
    - partition
    - consumergroup
  enabled: true
  period: 10s
  hosts: ["qetest-kafka-green-4.taulia.com:9093"]

similarly node-1 will have ...

- module: kafka
  metricsets:
    - partition
    - consumergroup
  enabled: true
  period: 10s
  hosts: ["qetest-kafka-green-1.taulia.com:9093"]

and so on for all 5 nodes

Is it relevant that we are using ACL's for our topics?
I dont think it should matter though because metricbeats has been configured to use the kafka-broker-cert itself which has super user ability in the server configuration. I believe super users dont need ACL's in kafka

Could you share some more output of the debug logs you get from metricbeat? Do you see any other errors? Every 30s you should see the # of acked and not acked events. Any details there?

For the ACL: Unfortunately don't know this part.

sure here are some of the latest logs from node4:


```2017-05-03T16:04:18Z DBG  known consumer groups: %!(EXTRA []string=[BrokerTestProchain2 BrokerTestProchain11 BrokerTestProchain7 BrokerTestProchain16])
2017-05-03T16:04:18Z WARN Closed connection to broker qetest-kafka-green-4.taulia.com:9093

2017-05-03T16:04:28Z INFO Non-zero metrics in the last 30s: fetches.kafka-consumergroup.success=3 fetches.kafka-partition.events=50 fetches.kafka-partition.success=1 fetches.system-cpu.events=3 fetches.system-cpu.success=3 fetches.system-filesystem.events=17 fetches.system-filesystem.success=2 fetches.system-fsstat.events=3 fetches.system-fsstat.success=3 fetches.system-load.events=3 fetches.system-load.success=3 fetches.system-memory.events=3 fetches.system-memory.success=3 fetches.system-network.events=5 fetches.system-network.success=3 fetches.system-process.events=22 libbeat.es.call_count.PublishEvents=3 libbeat.es.publish.read_bytes=1976 libbeat.es.publish.write_bytes=92668 libbeat.es.published_and_acked_events=101 libbeat.publisher.messages_in_worker_queues=101 libbeat.publisher.published_events=101
2017-05-03T16:04:28Z WARN Connected to broker at qetest-kafka-green-4.taulia.com:9093 (unregistered)

2017-05-03T16:04:28Z DBG  known consumer groups: %!(EXTRA []string=[BrokerTestProchain2 BrokerTestProchain11 BrokerTestProchain7 BrokerTestProchain16])
2017-05-03T16:04:28Z WARN Closed connection to broker qetest-kafka-green-4.taulia.com:9093

2017-05-03T16:04:28Z DBG  PublishEvents: 50 events have been  published to elasticsearch in 10.901275813s.
2017-05-03T16:04:28Z DBG  send completed
2017-05-03T16:04:28Z DBG  output worker: publish 1 events
2017-05-03T16:04:28Z DBG  Publish: {
  "@timestamp": "2017-05-03T16:04:02.808Z",
  "beat": {
    "hostname": "qetest-kafka-green-4.taulia.com",
    "name": "qetest-kafka-green-4.taulia.com",
    "version": "5.3.2"
  },.....
.....
....
2017-05-03T16:04:28Z WARN Connected to broker at qetest-kafka-green-4.taulia.com:9093 (unregistered)

2017-05-03T16:04:28Z DBG  fetch events for topic: %!(EXTRA string=com.taulia.test.config)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  fetch events for topic: %!(EXTRA string=__consumer_offsets)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  fetch events for topic: %!(EXTRA string=com.taulia.broker.test.1.rahul)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=3)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=5)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=2)
2017-05-03T16:04:28Z DBG  broker is not leader (broker=4, leader=1)
2017-05-03T16:04:28Z WARN Closed connection to broker qetest-kafka-green-4.taulia.com:9093

the command cat metricbeat | grep "fetches.kafka-consumergroup.events" did not return anything.

I have exactly the same issue with a 2 nodes cluster.
Is it not a problem with multiple nodes cluster?
The __consumer_offsets topic leader partitions are balance over each node.