Three types of WARN/ERRORs in filebeat logs

Hi, Filebeat experts,
we encountered three types of errors, copied below. I am wondering anyone could shed light on under what scenario each error would happen? is the warning or error serious? e.g., whether the error is transient? or filebeat still works fine with the erorr?

(1) 2017-10-23T22:59:42Z WARN kafka message: client/metadata got error from broker while fetching metadata:%!(EXTRA *net.OpError=read tcp 10.186.224.151:48230->10.186.34.103:10251: i/o timeout)

(2) 2017-10-23T23:49:12Z WARN kafka message: client/metadata got error from broker while fetching metadata:%!(EXTRA *errors.errorString=EOF)

(3) 2017-10-23T22:49:12Z ERR Not loading modules. Module directory not found: /..../bin/module

It would be nice to have context to these log lines, especially for the Kafka logs. Could you please share your whole Filebeat output?

The third message suggests that Filebeat modules are not loaded, so you wouldn't be able to use them. It can happen if you mis-configured path.home variable for Filebeat. See more about path.home here: https://www.elastic.co/guide/en/beats/filebeat/current/configuration-path.html#_home

1 Like

1/2: The kafka ones can be serious. The kafka cluster metadata are required to find brokers, topics, partitions current leaders. Without metadata no healthy connection can be made.

  1. Depends, if you are not using modules and filebeat is running properly you can ignore the error. Setting the path.home should help.
1 Like

@steffens and @kvch thanks for the reply! re. Q3 above, what are relevant modules?
We run filebeat as a standard/independent binary, installed in some location we specified, not in default system directory.
I guess I need to copy modules to "bin/module" directory, however, there are two module directory under filebeat: filebeat/module, and filebeat/scripts/module
I am wondering do I need to copy both to bin/module directory?
or I need to copy filebeat/module to bin/module and filebeat/scripts/module to bin/scripts/module?
I need to keep the original hierarchy directory under filebeat, right?
hmm, after i copied over filebeat/module and filebeat/scripts/module, that error on "ERR Not loading modules. Module directory not found: /..../bin/module" went away, but i encountered new errors:

2017-10-25T18:26:18-07:00 ERR Failed to create tempfile (/../data/registry.new) for writing: open /.../data/registry.new: no such file or directory
2017-10-25T18:26:18-07:00 ERR Writing of registry returned error: open /.../data/registry.new: no such file or directory. Continuing...
2017-10-25T18:26:28-07:00 ERR Failed to create tempfile (/..../data/registry.new) for writing: open /export/content/lid/apps/in-beats/dev-i001/data/registry.new: no such file or directory
2017-10-25T18:26:28-07:00 ERR Writing of registry returned error: open /..../data/registry.new: no such file or directory. Continuing...

there is data/registry, but no data/registry.new?

any tips on how to fix this? Thanks!

@kvch @steffens , re. kafka errors:
Here are more log lines.. sorry, it is too much work to mask our prod server names if i copy the entire filebeat logs, I copied a couple of occurrences with context
(1)
2017-10-24T15:48:06-07:00 INFO No non-zero metrics in the last 30s
2017-10-24T15:48:07-07:00 WARN client/metadata fetching metadata for all topics from broker kafka-logging-vip.stg:10251

2017-10-24T15:48:36-07:00 INFO Non-zero metrics in the last 30s: libbeat.output.kafka.bytes_write=23
2017-10-24T15:48:37-07:00 WARN kafka message: client/metadata got error from broker while fetching metadata:%!(EXTRA *net.OpError=read tcp 172.21.228.28:44493->10.251.140.37:10251: i/o timeout)
2017-10-24T15:48:37-07:00 WARN Closed connection to broker kafka-logging-vip.stg:10251
2017-10-24T15:48:37-07:00 WARN client/metadata fetching metadata for all topics from broker .stg.:25209
2017-10-24T15:48:37-07:00 WARN Connected to broker at .stg.:25209 (registered as #265009)
2017-10-24T15:49:06-07:00 INFO Non-zero metrics in the last 30s: libbeat.output.kafka.bytes_read=237352 libbeat.output.kafka.bytes_write=23
2017-10-24T15:49:36-07:00 INFO No non-zero metrics in the last 30s

  1. I private messaged you guys more logs, I have two questions:
  2. Are those "fetch meta data" errors transient? would it affect sending logs to Kafka?
  3. After a brokers was detected having problems and disconnected, why it was registered as new brokers soon afterwards?

TBH I have no idea what you are doing. As you seem to move things around and the working directory seems to be different from time to time I would suggest to use absolute paths. Also check path credentials, filebeat first writes a new registry to registry.new and then replaces the registry file with registry.new => If anything goes wrong on write we at least have some old state.

Seems like you might want to learn some kafka internals/architecture first :slight_smile: There are (at least) 3 kind of connections (e.g. not counting zookeeper)... producer->brokers, consumers->brokers, broker->broker. When a broker joins a cluster it registers with the other brokers -> all brokers will update the meta-data, thusly producers/consumers can query the meta-data from any broker. This is used to bootstrap the connections, as client(producer/consumer) will first get the meta-data from any broker and then will generate the actual producer->broker connections. Meanwhile producers/brokers/consumers will update the meta-data every know and then to keep up with leader elections. The catch is, having broker->broker connections work (complete cluster meta-data) does not mean producers can connect to all brokers. You can configure beats to ignore unreachable brokers -> distribute events only between reachable brokers. But as every broker is responsible for one partition, this includes events not being properly distributed between partitions (-> by default beats block if one partition can not be served).

Are those "fetch meta data" errors transient?

Metadata are crucial for operating a kafka-cluster. They kind of represent the cluster state (broker->partition assignments). Without meta-data a client can not produce or consumer events. Errors should be transient, given system can recover. Check your kafka cluster.

would it affect sending logs to Kafka?

Yes. Not knowing where to send logs to means we can not send logs.

After a brokers was detected having problems and disconnected, why it was registered as new brokers soon afterwards?

Brokers register with other brokers. Meta-data can contain brokers clients can not connect to (or invalid URLs). This is no beats problem, but a general kafka problem. Check your kafka logs (+ disk usage :wink: ).

Producers will retry fetching meta-data all the time and continue pushing events once a stable connection can be made. Seems like you are facing stability issues in your kafka cluster and/or network.

@steffens the registry error is one-time error. I tried to redeploy filebeat three times today, could not reproduce this. Not sure what happened yesterday..
We set binary and data to different sub-directories in prod deployment.. we did not move things around..

@steffens thanks for the explanations!:grinning:
(1) I understand that "after a broker was detected having problems and disconnected, it could be registered again" however, I am surprised that it could be registered as new brokers SO SOON, actually at the exact the same "ms". please note that the log lines appeared back-to-back in filebeat log. Alternatively, Filebeat could mark some broker node as bad node, and wont retry in the next, say, 1 min.
(2) from the logs below: after filebeat disconnected from node app0309, it got connected to broker app0265. Therefore, I was hoping with connecting to a different broker app0265, the error on node app0309 wont affect sending logs to kafka. Therefore, I do not need to worry about these kafka warnings.
please correct me if my understanding is wrong:)

2017-10-24T17:28:07-07:00 WARN client/metadata fetching metadata for all topics from broker app0309:25206
2017-10-24T17:28:07-07:00 WARN kafka message: client/metadata got error from broker while fetching metadata:%!(EXTRA *errors.errorString=EOF)
2017-10-24T17:28:07-07:00 WARN Closed connection to broker app0309:25206
2017-10-24T17:28:07-07:00 WARN client/brokers deregistered broker #309006 at app0309:25206
2017-10-24T17:28:07-07:00 WARN client/metadata fetching metadata for all topics from broker app0265:25206
2017-10-24T17:28:07-07:00 WARN Connected to broker at app0265:25206 (registered as #265006)
2017-10-24T17:28:07-07:00 WARN client/brokers registered new broker #309006 at app0309:25206

Let's say, it depends.

Each partition has an active leader. If the leader of a partition is not reachable, no data can be published to the affected partitions. A broker can be leader of multiple partitions, but a partition can have only one leader. Connecting to a different broker gives you no guarantees the partitions leader is reachable.

With kafka it is the client choosing which partitions data should be send to. It's up to you to decide if and how to distribute data. It's also up to you to decide if the client should pause publishing if any partition (broker being the leader) is not reachable.

The risk with not pausing is, all events might eventually end up in only one partition (worst case scenario), potentially killing the kafka cluster due to one broker running out of disk space. When sizing your cluster and configuring retention policies you have to take disk usage into account (default retention policy is by time).

In the meantime the client libs used by beats try to reconnect to failed brokers + update meta-data every so often.

If you are constantly facing network/connection issues, you should have a look at your network infrastructure. Kafka provides load-balancing, queuing and some resiliency/HA, but if you have often or long standing imbalances when publishing events you are jeopardising overall systems stability and throughput. e.g. disk usage not in balance between the different brokers/nodes. Plus each consumer in a consumer group exclusively reads from at least one partition (number of partitions gives you max parallelisation support among consumer group, having N partitions only up to N consumers in a consumer group can be served). Imbalances in data distribution enforced by producers can lead to imbalances in parallelised downstream systems/processing.

Check kafka output partitioning settings: https://www.elastic.co/guide/en/beats/filebeat/current/kafka-output.html#_partition

The default is:

kafka.output:
  partition.round_robin:
    group_events: 1
    reachable_only: false // set to true to not pause publishing if one parition becomes unavailable

Most of the functionality is provided by the kafka client libs used by beats. Client behavior is not special in beats, but common among all projects using kafka.

The deregister and register broker is in the client library used by beats, not the kafka cluster itself. deregister means: remove broker from active broker-set. register means: add broker to active broker-set. The logs indicate a 'try to reconnect'. The log sequence also include a fetch-metadata call to app0265 at the same time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.