How to get Grok Filter to see Newline and Carriage Returns?


(jeremiah adams) #1

I originally posted this over in stack overflow but I think I am unlikely to get an answer there so am posting here.

I am trying to parse our log files and send them to elasticsearch. The problem is that our S3 client is injecting lines into the file that contains carriage returns (\r) instead of new line chars (\n). The config for the File Input Filter using '\n' as the delimiter which is consistent with 99% of the data. When I run logstash against this data, it misses the last line which is what I am really looking for. This is because the File Input Filter is treating the '\r' characters as normal text and not new line. To get around this I am trying to use a Mutate Filter to rewrite the '\r' chars to '\n'. The mutate works, but Grok still sees it as one big line. and _grokparsefailure.

I expect to toss out the lines with the '\r' files and only parse the lines that look like a normal log4j entry. Problem is the key line I need is munged in with the '\r' garbage an Mutate filter is not causing the new '\n' characters to be re-evaluated.

Config

input {
    file {
        path => "/home/pa_stg/runs/2015-12-09-cron-1449666001/run.log"
        start_position => "beginning"
        sincedb_path => "/data/logstash/sincedb"
        stat_interval => 300
        type => "spark"
    }
}
filter{
    mutate {
         gsub => ["message", "\r", "
         "]
     }
     grok {
         match => {"message" => "\A%{DATE:date} %{TIME:time} %{LOGLEVEL:loglevel} %{SYSLOGPROG}%       {GREEDYDATA:data}"}
         break_on_match => false
     }
}
output{
    stdout { codec => rubydebug }
}

##Input
This sample from the input file illustrates the problem. The ^M characters are how vim displays the '\r' Carriage Returns ('more' hides most of them). I left the line as is so you can see that the whole thing is seen in linux and the File Plugin as a single line of text. I am trimming this input due to size limitations of the forum.

^M[Stage 79:=======>                                               (30 + 8) / 208]^M[Stage 79:============>                                          (49 + 8) / 208]^M[Stage 79:=================>                                     (65 + 8) / 208]^M[Stage 93:================================================>     (186 + 6) / 208]^M[Stage 93:=====================================================>(206 + 2) / 208]^M                                                                                ^M15/12/09 13:03:46 INFO SomethingProcessor$: Something Processor completed
15/12/09 13:04:44 INFO CassandraConnector: Disconnected from Cassandra cluster: int

##Output
Apologies for the formatting but it is butchered in the output as well. Key being that "message" here should only include the "15/12/09 13:03:46 INFO SomethingProcessor$: Something Processor completed" line. I am trimming most of the output due to size limitation of the forum.

{
   "message" => "\n[Stage 79:=======>                                               (30 + 8) / 208]\n[Stage 79:============>
                         (49 + 8) / 208]\n[Stage 79:=================>                                     (65 + 8) / 208]\n[Stage 93:=====================================================>(206 + 2) / 208]\n
                                                             \n15/12/09 13:03:46 INFO SomethingProcessor$: Something Processor com
pleted",
        "@version" => "1",
        "@timestamp" => "2015-12-09T22:16:52.898Z",
        "host" => "ip-10-252-1-225",
        "path" => "/home/something/pa_stg/runs/2015-12-09-cron-1449666001/run.log",
        "type" => "spark",
        "tags" => [
        [0] "_grokparsefailure"
    ]
}

(Craig Schotke) #2

Have you tried using the split filter? https://www.elastic.co/guide/en/logstash/current/plugins-filters-split.html

I think you should be able to split the lines on '\r' and then each line would be re-processed as a separate event.


(system) #3