Bizarre log structure for parsing in Logstash. Opinions appreciated

Hello everyone, I have a log file which basically doesn’t follow the format of common log files. Normally in the known log like Apache, you have an event logged in one line. Mine, however, spread out through multiline, kinda like COBOL logs. I already tried multiline codec to parse this but it ain’t helping much. I want to extract some information, I'll detail it at the end of the topic.

This topic is fairly long and complicated imo, so I really really appreciated your time and your opinion on this because I've never seen any log like this before. Feel free to quote me on anything if you don't get it. I'll be happy to buy you a coffee for sorting this out :slight_smile:

Here is an example of the structure of the log.

*****************************************************************
*            Chain ABC           (UNIX)                    *
***************************************************************** 

JOB: ABC                          DATE : 15/07/2019 12:15:42

*-------------------------- STEP000 ---------------------------*
*                  Initialization                 *
*---------------------------------------------------------------*

AA100PPP-STOC0001-------------------------------------
AA100PPP-STOC0002-PROGRAM IDENTIFICATION : PPP/2.000/2018-01-01/F/
AA100PPP-STOC0003-START PROCESSING – START TIMESTAMP : 2019-07-15-12.45.00
AA100PPP-STOC0004-....+....1....+....2....+....3....+....4....+....5
…
AA100PPP-STOC0010-*** PPP : END SUCCESSFULLY – END TIMESTAMP : 2019-07-15-12.46.00

*-------------------------- STEP001 ---------------------------*
*             Configuration              *
*---------------------------------------------------------------*
…
*-------------------------- STEP002 ---------------------------*
*             Execution              *
*---------------------------------------------------------------*
…
*-------------------------- STEP003 ---------------------------*
*             Computing              *
*---------------------------------------------------------------*
BBCPTXYZ-STOC0001-------------------------------------
BBCPTXYZ-STOC0002-PROGRAM IDENTIFICATION : TY1/2.000/2018-01-01/F/
*CD00TY1-STOC0002-PROGRAM IDENTIFICATION : XYZ/2.000/2018-02-14/F/
*CD00TY1-STOC0003-START PROCESSING – START TIMESTAMP : 2019-07-15-12.51.00
…
BBCPTXYZ-STOC0001-------------------------------------
*CD00TY1-STOC0002-PROGRAM IDENTIFICATION : OPM/2.000/2018-02-14/F/
*CD00TY1-STOC0003-START PROCESSING – START TIMESTAMP : 2019-07-15-12.53.00
BBCPTOPM-STOC0001-------------------------------------
BBCPTOPM-STOC0002-FILE EXTRACTED : 000000000002
*CD00TY1-STOC0010-*** XYZ :  END SUCCESSFULLY - END TIMESTAMP : 2019-07-15-12.55.00
…

JOB: ABC                          DATE : 15/07/2019 12:58:34

*---------------------------------------------------------------*
*                          End job                          *
*---------------------------------------------------------------*

Problem 1:
As you can see we have two lines defined the job ABC and two timestamps global of the whole job. The first one is start time and the second one if end time. Unfortunately, there is no way to tell the beginning and the end because they have the same format. I want the job, the timestamps added to every event extracted from this file.

Problem 2:
The content of the log file is divided by the block named STEP. For each step you have a main program identified by last three characters of the first eight characters at the first line of step and eventually many mini-program, I will talk about this in Problem 3 following.
For example in STEP000
AA100PPP-STOC0001-------------------------------------

PPP from AA100PPP is the main program of this step. It has two local timestamps, beginning and end time. With this information, it creates an event. In a nutshell, an event is defined like this:

{
job: ABC
timestampstartglobal: 15/07/2019 12:15:42
timestampendglobal: 15/07/2019 12:58:34 
process: PPP
timestampstartprocess: 2019-07-15-12.45.00
timestampendprocess:  2019-07-15-12.46.00
}

Problem 3:
In STEP003, the main program XYZ called mini program TY1. TY1 is now responsible to manage all the program in this step. We have here multiple PROGRAM IDENTIFICATION / START PROCESSING – START TIMESTAMP for each program. So for this step, the timestamps of XYZ is all I cared about but this info is chopped into 2 lines and somehow I need to tell a parser to take the start timestamp of XYZ at the line right under PROGRAM IDENTIFICATION
XYZ should take start timestamp from this line:
*CD00TY1-STOC0003-START PROCESSING – START TIMESTAMP : 2019-07-15-12.51.00
not the line of the mini program OPM
*CD00TY1-STOC0003-START PROCESSING – START TIMESTAMP : 2019-07-15-12.53.00

Problem 4:
Also in STEP003, I want to add a nested field to event with information extracted from the mini program OPM
BBCPTOPM-STOC0002-FILE EXTRACTED : 000000000002
In the end, we have an event like the one in Problem 2 with an extra field

{
job: ABC
process: XYZ
...
extraInfo: [
fileExtracted: 2
]
} 

Is there any way to solve all the problems and achieve the parsing for this kind of log? Thanks

Using ruby you can do pretty much anything. For example if you ingest the entire file as a single event using a multiline codec on a file input using a pattern that never matches and a timeout

 codec => multiline { pattern => "^Spalanzani" negate => true what => previous auto_flush_interval => 1

Then with this filter

    ruby {
        code => '
            message = event.get("message")
            overallTimestamps = message.scan(/^JOB: ([A-Za-z0-9]+)\s+DATE : ([0-9\/: ]+)/)
            event.set("overallTimestamps", overallTimestamps)
            stepTimestamps = message.scan(/PROGRAM IDENTIFICATION : ([A-Za-z0-9]+)\/[^
]+
[^
]+START PROCESSING – START TIMESTAMP : ([0-9-.]+)/)
            event.set("stepTimestamps", stepTimestamps)
        '
    }
}

You will get

"overallTimestamps" => [
    [0] [
        [0] "ABC",
        [1] "15/07/2019 12:15:42"
    ],
    [1] [
        [0] "ABC",
        [1] "15/07/2019 12:58:34"
    ]
],
   "stepTimestamps" => [
    [0] [
        [0] "PPP",
        [1] "2019-07-15-12.45.00"
    ],
    [1] [
        [0] "XYZ",
        [1] "2019-07-15-12.51.00"
    ],
    [2] [
        [0] "OPM",
        [1] "2019-07-15-12.53.00"
    ]
],

Note the use of literal newlines in the regexp - one or more not-newline followed by newline - to consume the rest of the line after the step name

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.