Backfilling with filebeat "projectwide" -- how best to enrich?

(Cprior) #1

Hi there!

Occasionally I get a bunch of logfiles from several machines, dating back a couple of weeks.
These are similar machines in terms of hardware and software,
and have a few different config items like hostname /uuid and such.
I tend to delete these indices after my analysis, never "stream" them into the index in the first place, but plain load with a prospector pointing to a path like C:\Temp\Loganalysis\foobar*\app-201*.log

I am sucessfully importing these with filebeat through logstash, and even manage to calculate elapsed time with-w 1 and "elapsed" -- and visualize them in Kibana. All very nice!

Now I want to go the next step and automate my "loading" even further.

My most important config items, hostname and uuid, are never in the first line(s) of any logfile, but part of the zip-Archive that I get delivered, like

  • db
  • db/export.csv
  • log
  • log/app.YYYY-MM-DD.log
  • uuid`
    I typically get 50-150 machine with 30-60 days of archives.

I do extract the logfiles from zip files with a bash script in cygwin, and also extract the uuid from an accompagnying file -- a huge for loop and I write everything into an intermediate folder structure. I love that part because seperating E-T-L feels liek the right thing to do.

Now what should I do with ${UUID} and C:\Temp\Loganalysis\foobar\${UUID}\log-201*.log containing abount 5% of interesting include_lines ?

Ideally I would

  • not just be able to enrich each "machine import" with its uuid,
  • but also be able to enrich each single file
    during "prospecting", because I try to save the md5sum of each file with each line (because I am paranoid to import a file twice, stupid rotation-by-filesize stupid! (I also have server.log and server.log.1 in log4j-fashion that I plan to import in the future once this app.YY-MM-DD.log is working.)

Should I write a filebeat.yml for each import and start (sequentially) a filebeat.exe for each machine with its uuid?

The logfile format itself does not lend itself to appending to the ean od each line, as there is no separator other than whitespaces. But if you experienced guys tell me to enrich each logline first then I'd also script that -- and give the E-T-L its own phase.

I also can do filebeat.exe -e -E fields.uuid='${UUID}' -c filebeat-applog.yml but that needs one filebeat*.yml per machine, because there is no command line option for the YAML-prospector-config with the - paths: - C:\Temp\Loganalysis\foobar\*\app-201*.log structure.
And it does not work with a field "md5sum" per file.

I know that backfilling is kind of against the grain, but is actually common outside the cosy world of datacenters! :wink:
So please don't scold me for such a usecase. :wink:

I would be very happy if some of you shared your "enriching procedures" with me.
If there was a way to "T-L" in one step I'd gladly take this shortcut.


(ruflin) #2

I'm throwing some thoughts in here:

  • We do you want to define one full filebeat.yml for each file? Why not have one prospector per file? If you have one prospector per file, you can add all the required fields above as fields into each prospector.
  • You can use config_dir to create lots of small config files if you have to
  • The yaml config files should be generated by your script

Could this help?

(Steffen Siering) #3

Another idea, if the path would contain all metadata you need , e.g. <name>_<value>_<name>_<value>..., you can try to parse the metadata via logstash from source field.

(Cprior) #4

These are interesting suggestions, many thanks: I did not think of any of them and both are easy to implement.

(I know enough YAML to understand that - starts a list item but to generate several never occured to me.)

I will testdrive the suggestions (and be careful with elapsed in my logstash instance, because I only gotelapsed working with with -w 1 -- and my "multi-import" might need conditionals to break the calculations.)

Will report back.

(Cprior) #5

A quick mockup script implements the first suggestion "generate a filebeat.yml prospector per file, and still gives me the opportunity to parse the source field.

#quite a few bashisms



for filetoindex in app.log.2017-01-01 app.log.2017-01-02 app.log.2017-01-03; do
_md5mockup=$(echo ${filetoindex} | md5sum | cut -c -32)
read -d '' tmp <<EOF
- input_type: log
    - ${BASEDIREXTR}/${_uuidmockup}/${filetoindex}
    project: cpriorpoc
    loganversion: 5
    md5sum: ${_md5mockup}
  include_lines: ["LOGIN:", "startTransaction",  "endTransaction" ]

#ugly but works

#echo -e "${PROSPECTORS}"

#to escape a \ use \\\\
read -d '' String <<EOF
  env: testingcpr

  hosts: ["localhost:5043"]

#needs the quotes
echo -e "${String}"

The result is a promising

- input_type: log
    - /cygdrive/c/Temp/Loganalysis/extracting/fffa6824-d7cb-431c-a53d-dcd391d39110/app.log.2017-01-01
    project: cpriorpoc
    loganversion: 5
    md5sum: e0d3c43c71a8d270af86099d8741912b
  include_lines: ["LOGIN:", "startTransaction",  "endTransaction" ]

- input_type: log
    - /cygdrive/c/Temp/Loganalysis/extracting/5f6f4ccc-da83-4398-8e7d-0fd7509fdb29/app.log.2017-01-02
    project: cpriorpoc
    loganversion: 5
    md5sum: 06b7a471eec08f4c5e4820806f2bc840
  include_lines: ["LOGIN:", "startTransaction",  "endTransaction" ]

- input_type: log
    - /cygdrive/c/Temp/Loganalysis/extracting/13f5ab89-2c09-4bfd-8a37-0ab1d44853e7/app.log.2017-01-03
    project: cpriorpoc
    loganversion: 5
    md5sum: d42b7c01c87b114e3ccc50b1267af452
  include_lines: ["LOGIN:", "startTransaction",  "endTransaction" ]

  env: testingcpr

  hosts: ["localhost:5043"]

(ruflin) #6

Good to hear it is working and thanks for sharing with others.

(Cprior) #7

I also found in the Issues the -once switch which seems to be available in the current download.
This is indeed a missing link for me to "loop in a script through backfilling jobs".

For all us "data analysts" this is a killer feature and it deserves every attention!
When I am done experimenting I will writeup a blog post.

(ruflin) #8

@cprior Great to hear. Can you link the blog post here when you publish it so we also get notified about it?

(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.