Trying to parse Hadoop log with logstash

hello,

I have a Hadoop cluster (Hortonworks) and i try to send hdfs/yarn/.... logs to logstash for parsing them.
The sent is made by filebeat.

My logs look like that (i think it is log4j type) :

2016-06-12 03:00:20,432 INFO  ipc.Client (Client.java:handleConnectionFailure(869)) - Retrying connect to server: d1hdpslave01.ouest-france.fr/128.1.228.46:8020. A
lready tried 25 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2016-06-12 03:00:20,660 ERROR datanode.DataNode (DataXceiver.java:run(278)) - d1hdpslave03.ouest-france.fr:50010:DataXceiver error processing unknown operation  sr
c: /128.1.228.49:51734 dst: /128.1.228.49:50010
java.io.EOFException
        at java.io.DataInputStream.readShort(DataInputStream.java:315)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227)
        at java.lang.Thread.run(Thread.java:745)

Here is my logstash config at the moment :

filter{
  grok {
match => [ "message", "(?m)%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:severity} {1,2}%{JAVACLASS:class} \(%{USERNAME:java_family}\.%{USERNAME:error_type}\:%{USERNAME:java_sub_family}\(%{INT:java_num}\)\) \- %{GREEDYDATA:message}" ]

    overwrite => [ "message" ]
  }
  mutate {
    remove_field => [ "[beat]","input_type","offset" ]
  }
}

Did anyone succeed to parse entirely this kind of log ? At least a little bit more the java error message.

Thanks,

This should work. What do you get? Please show an example event produced by a stdout { codec => rubydebug } codec.

Here is what i get :

2016-06-30T07:02:24.329Z dxxdpslavexx dXXdpslave0X.ouest-france.fr:50010:DataXceiver error processing unknown operation  src: /128.1.22x.xx:49066 dst: /128.1.22x.xx:50010
java.io.EOFException
        at java.io.DataInputStream.readShort(DataInputStream.java:315)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227)
        at java.lang.Thread.run(Thread.java:745)
{
            "message" => "dxxdpslavxxx.ouest-france.fr:50010:DataXceiver error processing unknown operation  src: /128.1.22x.xx:43851 dst: /128.1.22x.xx:50010\njava.io.EOFException\n\tat java.io.DataInputStream.readShort(DataInputStream.java:315)\n\tat org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58)\n\tat org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227)\n\tat java.lang.Thread.run(Thread.java:745)",
           "@version" => "1",
         "@timestamp" => "2016-06-30T07:02:32.405Z",
              "count" => 1,
             "source" => "/var/log/hadoop/hdfs/hadoop-hdfs-datanode-d1hdpslave02.log",
               "type" => "hdfs_datanode",
             "fields" => nil,
               "host" => "dxxdpslavexx",
               "tags" => [
        [0] "beats_input_codec_plain_applied"
    ],
          "timestamp" => "2016-06-30 09:02:23,269",
           "severity" => "ERROR",
              "class" => "datanode.DataNode",
        "java_family" => "DataXceiver",
         "error_type" => "java",
    "java_sub_family" => "run",
           "java_num" => "278"
}

Is there a way to parse a little bit more the "message" part?

Sure, just continue the grok expression in the same manner as you started it. I'm not sure I understand the difficulty. How, exactly, would you like to have the remainder of the message parsed?

The problem is, messages can be different:

2016-06-30T07:52:08.822Z d1hdpslave01 Get corrupt file blocks returned error: Operation category READ is not supported in state standby
{
            "message" => "Get corrupt file blocks returned error: Operation category READ is not supported in state standby",
           "@version" => "1",
         "@timestamp" => "2016-06-30T07:52:08.822Z",
             "source" => "/var/log/hadoop/hdfs/hadoop-hdfs-namenode-d1hdpslave01.log",
               "type" => "hdfs_namenode",
              "count" => 1,
             "fields" => nil,
               "host" => "d1hdpslave01",
               "tags" => [
        [0] "beats_input_codec_plain_applied"
    ],
          "timestamp" => "2016-06-30 09:52:08,774",
           "severity" => "WARN",
              "class" => "namenode.FSNamesystem",
        "java_family" => "FSNamesystem",
         "error_type" => "java",
    "java_sub_family" => "getCorruptFiles",
           "java_num" => "7324"
}

I don't know the full possibility of logstash. Maybe i can't go any further.

A single grok filter can try to match the message against multiple expressions (see example in the grok filter documentation), so you don't have to write a single expression that matches every string imaginable.