Specific GROK filter for multi-line Postgresql log

Hi All,

I am having hard times trying to merge multi-line Postgresql logs with GROK into one Logstash event.

My log file look like this:

Jul 22 17:03:27 my.host example.com[24977]: [137-1] 2016-07-22 17:03:27.339 MSK  User: username  Database: my_db  Host: 192.168.0.52(38494)  Proc ID: 24977  etc1
Jul 22 17:03:27 my.host example.com[24977]: [137-2] 2016-07-22 17:03:27.339 MSK  User: username  Database: my_db  Host: 192.168.0.52(38494)  Proc ID: 24977  etc2
Jul 22 17:03:27 my.host example.com[24597]: [2953-1] 2016-07-22 17:03:27.339 MSK  User: username  Database: my_db  Host: 192.168.0.52(38053)  Proc ID: 24597  etc
Jul 22 17:03:27 my.host example.com[3637]: [3779-1] 2016-07-22 17:03:27.340 MSK  User: username  Database: my_db  Host: 192.168.0.52(17809)  Proc ID: 3637  etc
Jul 22 17:03:27 my.host example.com[24977]: [138-1] 2016-07-22 17:03:27.339 MSK  User: username  Database: my_db  Host: 192.168.0.52(38494)  Proc ID: 24977  etc1
Jul 22 17:03:27 my.host example.com[3637]: [3780-1] 2016-07-22 17:03:27.340 MSK  User: username  Database: my_db  Host: 192.168.0.52(17809)  Proc ID: 3637  etc
Jul 22 17:03:27 my.host example.com[24977]: [138-2] 2016-07-22 17:03:27.339 MSK  User: username  Database: my_db  Host: 192.168.0.52(38494)  Proc ID: 24977  etc2 
Jul 22 17:03:27 my.host example.com[24977]: [139-1] 2016-07-22 17:03:27.340 MSK  User: username  Database: my_db  Host: 192.168.0.52(38494)  Proc ID: 24977  etc
Jul 22 17:03:27 my.host example.com[24597]: [2954-1] 2016-07-22 17:03:27.340 MSK  User: username  Database: my_db  Host: 192.168.0.52(38053)  Proc ID: 24597  etc

I try to use:

multiline {
pattern => "... [\d+-1]" # adjust pattern to be more exact
negate => true
what => "previous"
}

but I have:

line 1: ...[137-1] and [137-2]...
line 2: ...[2953-1]...
line 3: ...[3779-1]...
line 4: ...[138-1]...
line 5: ...[3780-1] and [138-2]...
line 6: ...[139-1]...
line 7: ...[2954-1]...

That's not what I need. I want:

line 1: ...[137-1] and [137-2]...
line 2: ...[2953-1]...
line 3: ...[3779-1]...
line 4: ...[138-1] and [138-2]...
line 5: ...[3780-1]...
line 6: ...[139-1]...
line 7: ...[2954-1]...

or

line 1: ...[137-1] and [137-2]...
line 2: ...[2953-1]...
line 3: ...[3779-1]...
line 4: ...[3780-1]...    
line 5: ...[138-1] and [138-2]...
line 6: ...[139-1]...
line 7: ...[2954-1]...

I would appreciate if someone can share any hints or his experience on this.

Thanks

Please give me an idea how to solve my problem.

You can't reliably use the multiline codec for this since log lines could be interleaved. Perhaps the aggregate filter can help?

Thanks. I read about this plugin and I think it's right. But I don't know ruby for write "code" section for aggregate plugin. Can you help me?

Sorry, I'm not that well-versed with aggregate myself and I don't have time to dig around.

People, who can help me write codec option for aggregate plugin?

Hi, aggregate plugin is the right plugin for your need.
Because your logs are interleaved.

That said, for one id, can you give an example of what you expect as result document ?

I expect the next:

event 1: ...[137-1] and [137-2]...
event 2: ...[2953-1]...
event 3: ...[3779-1]...
event 4: ...[138-1] and [138-2]...
event 5: ...[3780-1]...
event 6: ...[139-1]...
event 7: ...[2954-1] and [2954-2]...

Task_id in this case may be [24977] or 137.

My friend help me with ruby code and I made such a configuration:

filter {
grok {
match => [ "message", "%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:logsource} %{SYSLOGPROG}: [%{INT:line}-%{INT:part_of_line}] %{GREEDYDATA:ostatok}" ]
}

aggregate {
        task_id => "%{line}"
         code =>  "map['full_message'] ||= '' ; map['full_message'] += event['ostatok'] ; event['full_message'] ||= '' ; event['full_message'] = map['full_message']"
       }

}
But I have next event on elasticsearch:

event 1: ...[137-1]...
event 2: ...[137-1] and [137-2]...
event 3: ...[2953-1]...
event 4: ...[3779-1]...
event 5: ...[138-1]...
event 6: ...[138-1] and [138-2]...
event 7: ...[3780-1]...
event 8: ...[139-1]...
event 9: ...[2954-1]...
event 10: ...[2954-1] and [2954-2]...
How I can clean my events from event 1: ...[137-1]..., event 5: ...[138-1]..., event 7: ...[2954-1]... ?

@111126

Ok, given your logstash configuration, I understand now your need.
Your main problem is that you have no explicit "end line marker" to flush aggregate map.
But you're lucky ! Last week, aggregate plugin has been released with new options to deal with that case !

So, here's the right configuration for your need :

    grok {
        match => [ "message", "%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:logsource} %{SYSLOGPROG}: \[%{INT:line}-%{INT:part_of_line}\] %{GREEDYDATA:ostatok}" ]
    }
    
    aggregate {
        task_id => "%{line}"
        code =>  "
            map.merge!(event) if map.empty?
            map['full_message'] ||= ''
            map['full_message'] += event['ostatok']
        "
        timeout => 10
        push_map_as_event_on_timeout => true
        timeout_code => "event.tag('aggregated')"
    }

    if "aggregated" not in [tags] {
        drop {}
    }

Note that in your grok expression, [ and ] chars must be escaped.
And for aggregate, the main idea is that, as you have no explicit "end log line", we use 10s timeout to push aggregated map as a new logstash event in the pipeline.

1 Like

Thank you a lot. It's work for me.
But after the implementation of your settings, I found a feature that was previously not obvious.
But my log is very big and forwarding from many servers, I have coincidence in line numbering. Example:

Aug 11 11:34:53 my.host1 example.com1[21872]: [198-1] 2016-08-11 11:34:53.029 MSK etc
Aug 11 11:34:53 my.host1 example.com1[21878]: [198-1] 2016-08-11 11:34:53.150 MSK etc
Aug 11 11:34:53 my.host1 example.com1[21879]: [198-1] 2016-08-11 11:34:53.515 MSK etc
Aug 11 11:34:53 my.host3 example.com3[16548]: [17198-1] 2016-08-11 11:34:53.529 MSK etc
Aug 11 11:34:53 my.host2 example.com2[19241]: [198-1] 2016-08-11 11:34:53.722 MSK etc
Aug 11 11:34:53 my.host2 example.com2[19017]: [198-1] 2016-08-11 11:34:53.873 MSK etc
Aug 11 11:34:54 my.host1 example.com1[21901]: [198-1] 2016-08-11 11:34:54.091 MSK etc

Date, time, number of rows, PID process indication is the real, the rest of the data change for abstract.
If task_id = line, lines described above are merged into a single event and it's not right.
If I use task_id = pid, I have more bad result, because in a single PID handled many lines with different numbers.

Can I use double task_id (line and pid) ? How?

Yes, you can !
It's easy :

task_id => "%{pid}%{line}"

Oeeeah! Many Thanks.
You did me a really big help. Great!

Nice to see I helped you !
You're welcome !