How to configure Filebeat to ingest logs from nested workflow directories created after Filebeat starts

Similar to the question here, we need to ingest log files from a nested directory structure. However, different from that question and many other use cases, we are not ingesting logs from a static, persistent service or app, we are ingesting logs from workflows which we will be running over time.

Importantly, this means that the exact log files and directories that Filebeat needs to ingest from will not exist at the time Filebeat starts running, and will not be known ahead of time before Filebeat starts. Another caveat to this is that the directory structure will be huge (thousands of files and subdirs), so "crawling" the directory tree is not ideal.

The main directory structure looks like this;

/work/<timestamp>/<UUID>/logs/run.log

Examples;

/work/1621363120/de30d8af-4d2f-4dae-9ede-8a9dee31377e/logs/run.log
/work/1621363267/b6f9896b-1f20-4c42-aaf8-36bf9161f6c6/logs/run.log
/work/1621363290/327af1fc-4445-4e88-a088-42029d27ce86/logs/run.log

The "traditional" way to run Filebeat seems to be to start a single instance of Filebeat, pointing it to a single directory, or a list of directories, that you know ahead of time, such as;

filebeat -e -c config/filebeat.yml -E "filebeat.inputs=[{type:log,paths:['/work/logs/*']}]"

However this only works under the assumption that all logs are going to show up in the simple location(s) described from the config file or CLI args.

Its not clear to me how to run Filebeat in such a way that it will detect new logs produced over time in nested directory structures such as described here. Any suggestions?

One idea I had been testing out was to not run a single, persistent Filebeat service, but instead bundle running of Filebeat directly with running the our workflow. For example;

# run ID is pre-generated and supplied 
run_id="${1}"

# set locations to run the workflow
timestamp="$(date +%s)"
run_dir="/work/${timestamp}/${run_id}"
log_dir="${run_dir}/logs"
mkdir -p "${log_dir}"
workflow_log_file="${log_dir}/run.log"
filebeat_data="${run_dir}/filebeat_data"
filebeat_logs="${run_dir}/filebeat_log"
filebeat_pid="${run_dir}/filebeat.pid"

# start Filebeat, push it to the background
filebeat \
-c "/path/to/config/filebeat.yml"\
-E "filebeat.inputs=[{type:log,paths:['${workflow_log_file}']}]" \
-d "publish" \
--path.data "${filebeat_data}" \
--path.logs "${filebeat_logs}" & 

# record Filebeat process id
fb_pid="$!"
echo "${fb_pid}" > "${filebeat_pid}"

# kill Filebeat when everything finishes
trap "cat ${filebeat_pid} | xargs kill " EXIT TERM INT

# start running the workflow
toil-cwl-runner \
--writeLogs "${log_dir}" \
--logFile "${workflow_log_file}" \
--workDir "${run_dir}/work" \
--tmpdir-prefix "${run_dir}/tmp" \
--output-dir "${run_dir}/output" \
/path/to/workflow.cwl

# bash trap kills Filebeat when workflow is completed

However, upon first testing this I ran into the situation where the Filebeat "data" directory cannot handle having multiple instances of Filebeat running. I was able to get around that by creating a new Filebeat data directory bundled inside the workflow directory, though this does not seem like the way you are "supposed" to do this.

I also am not sure if this would scale well. We intend to have many workflows being spawned and running in parallel over time, meanwhile it appears that each Filebeat -> Logstash connection needs to have its own pre-mapped network address & port. So it seems like running like this will still run into port collisions on the host system when more than one Filebeat instance is running and trying to communicate with Logstash.

Is there a better way to do this? What is the general approach you are supposed to take when you want to ingest new log files created over time in different locations?

Really it feels like Filebeat is the wrong tool for this situation, since it would be much easier to have a persistent service that I could just communicate with and explicitly tell it each new log file to ingest from, instead of having the service magically figure it out itself. After all, when I start my workflow, I know where the log file will be, but it seems like I have no way of actually communicating that to Filebeat or Logstash after they have started.

After messing with it for a while I found that Filebeat actually works fine for detecting the new log files at newly created subdirs under its search dir, and that starting & stopping a new Filebeat with every process was not a good idea. So normal Filebeat functionality worked fine here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.