Filebeat will hang forever until you force a shutdown with "kill -9"

When I start filebeat with the config file below, I can't stop it using Ctrl + C

config file : debug.yaml

# genaral
max_procs: 1
logging.metrics.enabled: false

# log
logging.to_files: true
logging.files:
  path: /data/home/logs

# queue
queue.mem:
  events: 8192
  flush.min_events: 4096
  flush.timeout: 5s

# input
filebeat.inputs:

- close_timeout: "5m"
  enabled: true
  encoding: "utf-8"
  exclude_files: 
  - "*"
  ignore_older: "24h"
  ignore_old_file_truncate: false
  paths:
  - "/data/home/test1.log"
  scan_frequency: "10s"
  tail_files: true
  type: "log"

output.console:
  pretty: true

Start command and result.

# ./filebeat7.17.2 -c debug.yaml 
^C^C

After using Ctrl + C to stop filebeat, it is still running. So I have to kill it with "kill -9" in another session.

# ps -ef | grep filebeat
root      1786 27940  0 16:10 pts/0    00:00:00 ./filebeat7.17.2 -c debug.yaml
root      4577 30030  0 16:20 pts/1    00:00:00 grep --color=auto filebeat
# kill 1786
# ps -ef | grep filebeat
root      1786 27940  0 16:10 pts/0    00:00:00 ./filebeat7.17.2 -c debug.yaml
root      4705 30030  0 16:21 pts/1    00:00:00 grep --color=auto filebeat
# kill -9 1786
# ps -ef | grep filebeat
root      4750 30030  0 16:21 pts/1    00:00:00 grep --color=auto filebeat

The filebeat log is shown as follows.

2022-09-14T16:10:36.674+0800    INFO    instance/beat.go:328    Setup Beat: filebeat; Version: 7.17.2
2022-09-14T16:10:36.674+0800    INFO    [publisher]     pipeline/module.go:113  Beat name: VM-77-52-centos
2022-09-14T16:10:36.675+0800    WARN    beater/filebeat.go:202  Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning.
2022-09-14T16:10:36.675+0800    INFO    instance/beat.go:492    filebeat start running.
2022-09-14T16:10:36.675+0800    INFO    memlog/store.go:119     Loading data file of '/data/home/filebeat/data/registry/filebeat' succeeded. Active transaction id=0
2022-09-14T16:10:36.780+0800    INFO    memlog/store.go:124     Finished loading transaction log file for '/data/home/filebeat/data/registry/filebeat'. Active transaction id=8492
2022-09-14T16:10:36.780+0800    WARN    beater/filebeat.go:411  Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning.
2022-09-14T16:10:36.780+0800    INFO    [registrar]     registrar/registrar.go:109      States Loaded from registrar: 4
2022-09-14T16:10:36.780+0800    INFO    [crawler]       beater/crawler.go:71    Loading Inputs: 1
2022-09-14T16:10:36.781+0800    INFO    [crawler]       beater/crawler.go:117   starting input, keys present on the config: [filebeat.inputs.0.close_timeout filebeat.inputs.0.enabled filebeat.inputs.0.encoding filebeat.inputs.0.exclude_files.0 filebeat.inputs.0.ignore_old_file_truncate filebeat.inputs.0.ignore_older filebeat.inputs.0.paths.0 filebeat.inputs.0.scan_frequency filebeat.inputs.0.tail_files filebeat.inputs.0.type]
2022-09-14T16:10:36.781+0800    WARN    [cfgwarn]       log/input.go:89 DEPRECATED: Log input. Use Filestream input instead.
2022-09-14T16:10:36.781+0800    INFO    beater/crawler.go:155   Stopping Crawler
2022-09-14T16:10:36.781+0800    INFO    beater/crawler.go:165   Stopping 0 inputs
2022-09-14T16:10:36.781+0800    INFO    beater/crawler.go:185   Crawler stopped
2022-09-14T16:10:36.781+0800    INFO    [registrar]     registrar/registrar.go:132      Stopping Registrar
2022-09-14T16:10:36.782+0800    INFO    [registrar]     registrar/registrar.go:166      Ending Registrar
2022-09-14T16:10:36.782+0800    INFO    [registrar]     registrar/registrar.go:137      Registrar stopped

When I start filebeat with the config file below, filebeat stopped with error

config file:debug2.yaml

# genaral
max_procs: 1
logging.metrics.enabled: false

# log
logging.to_files: true
logging.files:
  path: /data/home/logs

# queue
queue.mem:
  events: 8192
  flush.min_events: 4096
  flush.timeout: 5s

# input
filebeat.inputs:

- close_timeout: "5m"
  enabled: true
  encoding: "utf-8"
  exclude_files: []
  ignore_older: "24h"
  ignore_old_file_truncate: false
  paths:
  - "/data/home/test.log"
  scan_frequency: "10s"
  tail_files: true
  type: "log"

- close_timeout: "5m"
  enabled: true
  encoding: "utf-8"
  exclude_files: 
  - "*"
  ignore_older: "24h"
  ignore_old_file_truncate: false
  paths:
  - "/data/home/test1.log"
  scan_frequency: "10s"
  tail_files: true
  type: "log"

output.console:
  pretty: true

Start command and result.

# ./filebeat7.17.2 -c debug2.yaml 
Exiting: Failed to start crawler: starting input failed: error while initializing input: error parsing regexp: missing argument to repetition operator: `*` accessing 'filebeat.inputs.1.exclude_files.0' (source:'debug2.yaml')

The filebeat log is shown as follows:

2022-09-14T16:28:55.263+0800    INFO    instance/beat.go:328    Setup Beat: filebeat; Version: 7.17.2
2022-09-14T16:28:55.263+0800    INFO    [publisher]     pipeline/module.go:113  Beat name: VM-77-52-centos
2022-09-14T16:28:55.263+0800    WARN    beater/filebeat.go:202  Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning.
2022-09-14T16:28:55.264+0800    INFO    instance/beat.go:492    filebeat start running.
2022-09-14T16:28:55.264+0800    INFO    memlog/store.go:119     Loading data file of '/data/home/filebeat/data/registry/filebeat' succeeded. Active transaction id=0
2022-09-14T16:28:55.370+0800    INFO    memlog/store.go:124     Finished loading transaction log file for '/data/home/filebeat/data/registry/filebeat'. Active transaction id=8496
2022-09-14T16:28:55.370+0800    WARN    beater/filebeat.go:411  Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning.
2022-09-14T16:28:55.370+0800    INFO    [registrar]     registrar/registrar.go:109      States Loaded from registrar: 4
2022-09-14T16:28:55.370+0800    INFO    [crawler]       beater/crawler.go:71    Loading Inputs: 2
2022-09-14T16:28:55.371+0800    INFO    [crawler]       beater/crawler.go:117   starting input, keys present on the config: [filebeat.inputs.0.close_timeout filebeat.inputs.0.enabled filebeat.inputs.0.encoding filebeat.inputs.0.ignore_old_file_truncate filebeat.inputs.0.ignore_older filebeat.inputs.0.paths.0 filebeat.inputs.0.scan_frequency filebeat.inputs.0.tail_files filebeat.inputs.0.type]
2022-09-14T16:28:55.371+0800    WARN    [cfgwarn]       log/input.go:89 DEPRECATED: Log input. Use Filestream input instead.
2022-09-14T16:28:55.371+0800    INFO    [input] log/input.go:171        Configured paths: [/data/home/test.log] {"input_id": "7d57477b-15e0-4967-941e-daaafec3ff2f"}
2022-09-14T16:28:55.371+0800    INFO    [crawler]       beater/crawler.go:148   Starting input (ID: 4560358641352877852)
2022-09-14T16:28:55.371+0800    INFO    [crawler]       beater/crawler.go:117   starting input, keys present on the config: [filebeat.inputs.1.close_timeout filebeat.inputs.1.enabled filebeat.inputs.1.encoding filebeat.inputs.1.exclude_files.0 filebeat.inputs.1.ignore_old_file_truncate filebeat.inputs.1.ignore_older filebeat.inputs.1.paths.0 filebeat.inputs.1.scan_frequency filebeat.inputs.1.tail_files filebeat.inputs.1.type]
2022-09-14T16:28:55.371+0800    INFO    beater/crawler.go:155   Stopping Crawler
2022-09-14T16:28:55.371+0800    INFO    beater/crawler.go:165   Stopping 1 inputs
2022-09-14T16:28:55.373+0800    INFO    [crawler]       beater/crawler.go:170   Stopping input: 4560358641352877852
2022-09-14T16:28:55.373+0800    INFO    input/input.go:136      input ticker stopped
2022-09-14T16:28:55.373+0800    INFO    beater/crawler.go:185   Crawler stopped
2022-09-14T16:28:55.373+0800    INFO    [registrar]     registrar/registrar.go:132      Stopping Registrar
2022-09-14T16:28:55.373+0800    INFO    [registrar]     registrar/registrar.go:166      Ending Registrar
2022-09-14T16:28:55.373+0800    INFO    [registrar]     registrar/registrar.go:137      Registrar stopped
2022-09-14T16:28:55.376+0800    INFO    instance/beat.go:497    filebeat stopped.

After some digging it seems to hang on https://github.com/elastic/beats/blob/v7.17.2/filebeat/beater/filebeat.go:376 while it waits for a reference to the registry to be released, causing it to wait forever.

A detailed description will be given below.

When the first input in the config file is configured with exclude_files, and exclude_files is configured with an incorrect regular expression, such as "*". The error will be returned at filebeat.go:468 and the crawler will be stopped after that.

After the crawler stopped, we focus on filebeat.go:422, the taskgroup is stopped which means that no more goroutine can be started.

Taskgroup is used in filebeat.go:423 -> loader.go:68 -> manager.go:135.

As I mentioned above, the inability to release the reference to the registry caused the filebeat process to hang. The logic of releasing registry reference is in cim.shutdown()(manager.go:136). If the Go method(manager.go:135) successful start the task, cim.shutdown() can be finally executed and filebeat process can exit with an error.
However, the hung process means that some exception occurred when calling the Go method of the task group, causing the task not to execute.
So let's turn to the detail of the Go method (taskgroup.go:152)

func (t *TaskGroup) Go(fn func(context.Context) error) error {
	t.init(context.Background())

	if err := t.wg.Add(1); err != nil {
		return err
	}

	go func() {
		defer t.wg.Done()

		for t.closer.Err() == nil {
			err := fn(t.closer)
			action, err := t.OnQuit(err)

			if err != nil && err != context.Canceled {
				t.mu.Lock()
				t.errs = append(t.errs, err)
				if t.MaxErrors > 0 && len(t.errs) > t.MaxErrors {
					t.errs = t.errs[1:]
				}
				t.mu.Unlock()
			}

			switch action {
			case TaskGroupStopActionContinue:
				return // finish managed go-routine, but keep other routines alive.
			case TaskGroupStopActionShutdown:
				t.signalStop()
				return
			case TaskGroupStopActionRestart:
				// continue with loop
			}
		}
	}()

	return nil
}

The task "fn" is only executed when "t.closer.Err() == nil". However, after the taskgroup stopped, t.closer.Err() is "context canceled" and the task can not be executed. It means that cim.shutdown()(manager.go:136) will never be called to release the reference of the registry.

When the first input causes the crawler to stop, the task group stops too quickly to start the goroutine that will eventually release the registry's reference.

Hence, the filebeat process only can be forced to stop with "kill -9".

The reference code file can be found as belows:

filebeat.go: beats/filebeat.go at v7.17.2 · elastic/beats · GitHub

loader.go: beats/loader.go at v7.17.2 · elastic/beats · GitHub

manager.go: beats/manager.go at v7.17.2 · elastic/beats · GitHub

taskgroup.go: go-concert/taskgroup.go at v0.2.0 · elastic/go-concert · GitHub

Does this still happen if you use a newer version of Filebeat?

Yes, it still happens in version 8.4.1. In fact, after finding this exception I tested the latest version, wondering if this exception has been fixed. But before answering your question, I reconfirmed this. However, at first I couldn't reproduce the exception because I did other tests with filebeat the other day that caused the registry file to get very large. After realizing the difference between the two tests, I deleted the registry file and got the same environment as the last test. After that, I started filebeat old version 7.17.2 and new version 8.4.1 again with the wrong configuration, both versions got the same exception.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.