Backfilling with filebeat "projectwide" -- how best to enrich?

cprior · January 25, 2017, 5:28pm

Hi there!

Occasionally I get a bunch of logfiles from several machines, dating back a couple of weeks.
These are similar machines in terms of hardware and software,
and have a few different config items like hostname /uuid and such.
I tend to delete these indices after my analysis, never "stream" them into the index in the first place, but plain load with a prospector pointing to a path like C:\Temp\Loganalysis\foobar*\app-201*.log

I am sucessfully importing these with filebeat through logstash, and even manage to calculate elapsed time with-w 1 and "elapsed" -- and visualize them in Kibana. All very nice!

Now I want to go the next step and automate my "loading" even further.

My most important config items, hostname and uuid, are never in the first line(s) of any logfile, but part of the zip-Archive that I get delivered, like
`archive.YYYY-MM-DD.zip

db
db/export.csv
log
log/app.YYYY-MM-DD.log
uuid`
I typically get 50-150 machine with 30-60 days of archives.

I do extract the logfiles from zip files with a bash script in cygwin, and also extract the uuid from an accompagnying file -- a huge for loop and I write everything into an intermediate folder structure. I love that part because seperating E-T-L feels liek the right thing to do.

Now what should I do with ${UUID} and C:\Temp\Loganalysis\foobar\${UUID}\log-201*.log containing abount 5% of interesting include_lines ?

Ideally I would

not just be able to enrich each "machine import" with its uuid,
but also be able to enrich each single file
during "prospecting", because I try to save the md5sum of each file with each line (because I am paranoid to import a file twice, stupid rotation-by-filesize stupid! (I also have server.log and server.log.1 in log4j-fashion that I plan to import in the future once this app.YY-MM-DD.log is working.)

Should I write a filebeat.yml for each import and start (sequentially) a filebeat.exe for each machine with its uuid?

The logfile format itself does not lend itself to appending to the ean od each line, as there is no separator other than whitespaces. But if you experienced guys tell me to enrich each logline first then I'd also script that -- and give the E-T-L its own phase.

I also can do filebeat.exe -e -E fields.uuid='${UUID}' -c filebeat-applog.yml but that needs one filebeat*.yml per machine, because there is no command line option for the YAML-prospector-config with the - paths: - C:\Temp\Loganalysis\foobar\*\app-201*.log structure.
And it does not work with a field "md5sum" per file.

I know that backfilling is kind of against the grain, but is actually common outside the cosy world of datacenters! https://medium.com/@cprior/a-blind-spot-in-textbook-service-management-1b464dc0aec9
So please don't scold me for such a usecase.

I would be very happy if some of you shared your "enriching procedures" with me.
If there was a way to "T-L" in one step I'd gladly take this shortcut.

BR,
Chris

ruflin · January 26, 2017, 12:43pm

I'm throwing some thoughts in here:

We do you want to define one full filebeat.yml for each file? Why not have one prospector per file? If you have one prospector per file, you can add all the required fields above as fields into each prospector.
You can use config_dir to create lots of small config files if you have to
The yaml config files should be generated by your script

Could this help?

steffens · January 26, 2017, 12:46pm

Another idea, if the path would contain all metadata you need , e.g. <name>_<value>_<name>_<value>..., you can try to parse the metadata via logstash from source field.

cprior · January 26, 2017, 1:41pm

These are interesting suggestions, many thanks: I did not think of any of them and both are easy to implement.

(I know enough YAML to understand that - starts a list item but to generate several never occured to me.)

I will testdrive the suggestions (and be careful with elapsed in my logstash instance, because I only gotelapsed working with with -w 1 -- and my "multi-import" might need conditionals to break the calculations.)

Will report back.

cprior · January 26, 2017, 2:28pm

A quick mockup script implements the first suggestion "generate a filebeat.yml prospector per file, and still gives me the opportunity to parse the source field.

#!/bin/bash
#quite a few bashisms

BASEDIREXTR=/cygdrive/c/Temp/Loganalysis/extracting

PROSPECTORS=""

#mockup
for filetoindex in app.log.2017-01-01 app.log.2017-01-02 app.log.2017-01-03; do
tmp=""
_md5mockup=$(echo ${filetoindex} | md5sum | cut -c -32)
_uuidmockup=$(uuidgen)
read -d '' tmp <<EOF
- input_type: log
  paths:
    - ${BASEDIREXTR}/${_uuidmockup}/${filetoindex}
  fields:
    project: cpriorpoc
    loganversion: 5
    md5sum: ${_md5mockup}
  include_lines: ["LOGIN:", "startTransaction",  "endTransaction" ]
EOF

#ugly but works
PROSPECTORS+="${tmp}

"
done
#echo -e "${PROSPECTORS}"

#http://serverfault.com/a/72511
#to escape a \ use \\\\
read -d '' String <<EOF
filebeat.prospectors:
${PROSPECTORS}
fields:
  env: testingcpr

output.logstash:
  hosts: ["localhost:5043"]
EOF

#needs the quotes
echo -e "${String}"

The result is a promising

filebeat.prospectors:
- input_type: log
  paths:
    - /cygdrive/c/Temp/Loganalysis/extracting/fffa6824-d7cb-431c-a53d-dcd391d39110/app.log.2017-01-01
  fields:
    project: cpriorpoc
    loganversion: 5
    md5sum: e0d3c43c71a8d270af86099d8741912b
  include_lines: ["LOGIN:", "startTransaction",  "endTransaction" ]

- input_type: log
  paths:
    - /cygdrive/c/Temp/Loganalysis/extracting/5f6f4ccc-da83-4398-8e7d-0fd7509fdb29/app.log.2017-01-02
  fields:
    project: cpriorpoc
    loganversion: 5
    md5sum: 06b7a471eec08f4c5e4820806f2bc840
  include_lines: ["LOGIN:", "startTransaction",  "endTransaction" ]

- input_type: log
  paths:
    - /cygdrive/c/Temp/Loganalysis/extracting/13f5ab89-2c09-4bfd-8a37-0ab1d44853e7/app.log.2017-01-03
  fields:
    project: cpriorpoc
    loganversion: 5
    md5sum: d42b7c01c87b114e3ccc50b1267af452
  include_lines: ["LOGIN:", "startTransaction",  "endTransaction" ]


fields:
  env: testingcpr

output.logstash:
  hosts: ["localhost:5043"]

ruflin · January 30, 2017, 7:30am

Good to hear it is working and thanks for sharing with others.

cprior · January 30, 2017, 8:32am

I also found in the Issues the -once switch which seems to be available in the current https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-5.1.2-windows-x86_64.zip download.
This is indeed a missing link for me to "loop in a script through backfilling jobs".

For all us "data analysts" this is a killer feature and it deserves every attention!
When I am done experimenting I will writeup a blog post.

ruflin · January 31, 2017, 12:10pm

@cprior Great to hear. Can you link the blog post here when you publish it so we also get notified about it?

system · February 28, 2017, 12:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.