Parse nested key-values


#1

Hi,

I am trying to parse Oracle listener.log file. Here is a sample:

29-NOV. -2017 14:34:48 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=localhost)(USER=grid))(COMMAND=status)(ARGUMENTS=64)(SERVICE=LISTENER)(VERSION=203424000)) * status * 0
29-NOV. -2017 14:34:48 * version * 0
29-NOV. -2017 14:34:49 * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.1)(PORT=10484)) * service_register * LsnrAgt * 0
29-NOV. -2017 14:34:52 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=localhost)(USER=grid))(COMMAND=status)(ARGUMENTS=64)(SERVICE=LISTENER)(VERSION=203424000)) * status * 0
29-NOV. -2017 14:35:53 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=localhost)(USER=grid))(COMMAND=status)(ARGUMENTS=64)(SERVICE=LISTENER)(VERSION=203424000)) * status * 0
29-NOV. -2017 14:36:53 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=localhost)(USER=grid))(COMMAND=status)(ARGUMENTS=64)(SERVICE=LISTENER)(VERSION=203424000)) * status * 0
29-NOV. -2017 14:37:48 * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.1)(PORT=10488)) * service_register * +ASM * 0
29-NOV. -2017 14:37:51 * service_update * +ASM * 0

I have successfully extracted the message part using a grok filter, and I now want to parse this (CONNECT_DATA=(CID=(PROGRAM=)(HOST=localhost)(USER=grid))(COMMAND=status)(ARGUMENTS=64)(SERVICE=LISTENER)(VERSION=203424000)).
The target should be

{
  "connect_data": {
    "cid": {
      "program": "",
      "host": "localhost",
      "user": "grid"
    },
    "command": "status",
    "arguments": 64,
    "service": "listener",
    "version": 20342400
  }
}

This looks like a nested key-value, so I tried with this configuration

kv {
  source => "connect_data"
  include_brackets => "false"
  recursive => "true"
  transform_key => "lowercase"
  value_split => "="
  trim_key => "\(\)"
  trim_value => "\(\)"
}

and multiple variants of it (such as include_brackets=false but without trimming, include_brackets=true and with trimming).

Is there a way to parse this format with the kv filter? If not, what are the alternatives?

Thanks for you answers


#2

I would try using mutate+gsub to convert that into JSON, then use a json filter.


#3

I ended up doing what you proposed, and I almost got it working, but it seems that Logstash is behaving weirdly with one of my regex.

As a reminder, here is what grok extracts from a log line:

(CONNECT_DATA=(CID=(PROGRAM=)(HOST=vkav-479-proorderint)(USER=grid))(COMMAND=status)(ARGUMENTS=64)(SERVICE=LISTENER)(VERSION=203424000))

I configured Logstash as such:

filter {
  mutate {
    gsub => [
      "connect_data", "=\)", ':"")', # Add empty values
      "connect_data", "\)\(", ",", # )( are between multiple keys in the same object
      "connect_data", "\(", "{",
      "connect_data", "\)", "}",
      "connect_data", "=", ":",
      "connect_data", '([\w- ]+)', '"\1"', # Add quotes around keys and values
      "connect_data", ':"(\d+(?:\.\d+)?)"', ":\1" # Remove quotes around numbers, to use JSON native number type
    ]
  }

  json {
    source => "connect_data"
    remove_field => "connect_data"
  }
}

I tested this regex on this specific line both in Sublime Text and on regex101.com, and they all replace the line with

{"CONNECT_DATA":{"CID":{"PROGRAM":"","HOST":"vkav-479-proorderint","USER":"grid"},"COMMAND":"status","ARGUMENTS":64,"SERVICE":"LISTENER","VERSION":203424000}}

which is valid JSON.

But when I run my logs through Logstash, I get this

{"CONNECT_DATA":{"CID":{"PROGRAM":"","HOST":"vkav"-"479"-"proorderint","USER":"grid"},"COMMAND":"status","ARGUMENTS":64,"SERVICE":"LISTENER","VERSION":203424000}}

As you can see, there are too many quotes around the host part.

Is my regex wrong in Logstash context then?


#4
"connect_data", '([\w\- ]+)', '"\1"',

You need to escape the -, otherwise it takes this as a range. No idea what range, but that's the issue.


#5

Wow, great catch! After a few hours trying everything ^^'

Only a few errors in the regex to fix, now that they have arisen...


#6

I am now struggling to replace single \ with double \ (to have valid JSON), I don't know how many times I have to escape them in the configuration ^^'

Here is the objective: c:\users\User\appdata\local\programs\python\python36\python.exe => c:\\users\\User\\appdata\\local\\programs\\python\\python36\\python.exe

I have tried

gsub => [
  "connect_data", "\\", "\\\\"
]
gsub => [
  "connect_data", '\\', '\\\\'
]

both don't even start.

This workaround

gsub => [
  "connect_data", "[\\]", "TESTTESTTEST"
]

does find \ correctly, but I tried to replace with '\\\\' and Logstash fails to start.

What should I write in the gsub replace to output 2 backslashes?


#7

Also, Logstash doesn't support text case conversion with regex?
Like this: (\w+) => "\L\1" (https://www.regular-expressions.info/replacecase.html)


#9

Enable 'config.support_escapes: true' in your logstash.yml and then try this

    mutate { gsub => [ "message", "[\\\\]", "\\\\\\\\@@@@" ] }
    mutate { gsub => [ "message", "@@@@", "" ] }

You need the @@@@ junk because otherwise the final \ will escape the closing "


Gsub replacing \\n with \n
#10

It works! Thank you so much for your help!

For the record, here is the configuration I used to parse Oracle listener.log:

    grok {
      id => "parse_oracle_listenerlog"
      match => [ "message", "^%{CUSTOM_DATESTAMP:timestamp} \* %{DATA:connect_data}(\* %{PROTOCOL_INFO:protocol_info})? \* %{DATA:event}( \* %{DATA:sid})? \* %{INT:return_code}$" ]
      pattern_definitions => {
        CUSTOM_DATESTAMP => "%{MONTHDAY}-%{DATA}-%{YEAR} %{TIME}"
        PROTOCOL_INFO => "\(%{DATA}\)"
      }
    }

    translate {
      id => "translate_month_names"
      field => "timestamp"
      dictionary_path => "/etc/logstash/pipelines.d/oracle/months.yml"
      refresh_behaviour => "replace"
      destination => "timestamp"
      exact => false
      override => true
    }

    if [connect_data] =~ /^\(/ {
      mutate {
        id => "convert_connect_data_to_json_string"
        gsub => [
          "connect_data", "=\)", '="")',                         # Add empty values
          "connect_data", "\)\(", ",",                           # )( are between multiple keys in the same object
          "connect_data", "\(", "{",
          "connect_data", "\)", "}",
          "connect_data", '(\w+)=', '"\1"=',                     # Add quotes around keys
          "connect_data", '=([^"{\[].*?)([,}\]])', '="\1"\2',    # Add quotes around values
          "connect_data", "=", ":",
          "connect_data", ':"(\d+(?:\.\d+)?)"', ":\1",           # Remove quotes around numbers, to use JSON native number type
          "connect_data", '[\\\\]', '\\\\\\\\@@@@',              # Replace \ with double \\, the @@@@ junk is there because otherwise the final \ would escape the closing...
          "connect_data", "@@@@", ""                             # Remove the junk we added previously
        ]
      }

      json {
        id => "parse_json_connect_data"
        source => "connect_data"
        remove_field => "connect_data"
      }
    }

    if [protocol_data] {
      mutate {
        id => "convert_protocol_data_to_json_string"
        gsub => [
          "protocol_data", "=\)", '="")',                         # Add empty values
          "protocol_data", "\)\(", ",",                           # )( are between multiple keys in the same object
          "protocol_data", "\(", "{",
          "protocol_data", "\)", "}",
          "protocol_data", '(\w+)=', '"\1"=',                     # Add quotes around keys
          "protocol_data", '=([^"{\[].*?)([,}\]])', '="\1"\2',    # Add quotes around values
          "protocol_data", "=", ":",
          "protocol_data", ':"(\d+(?:\.\d+)?)"', ":\1",           # Remove quotes around numbers, to use JSON native number type
          "protocol_data", '[\\\\]', '\\\\\\\\@@@@',              # Replace \ with double \\, the @@@@ junk is there because otherwise the final \ would escape the closing...
          "protocol_data", "@@@@", ""                             # Remove the junk we added previously
        ]
      }

      json {
        id => "parse_json_protocol_data"
        source => "protocol_data"
        remove_field => "protocol_data"
      }
    }


    # replace the timestamp by the one coming from the listener.log
    date {
      match => [ "timestamp" , "dd-MMM-yyyy HH:mm:ss" ]
      target => "@timestamp"
      remove_field => [ "timestamp" ]
    }
  }

with a months.yml that looks like this (because for some reason our DB logs in French, which also explains the weird grok format for the date...)

"JANV.": Jan
"FÉVR.": Feb
"MARS ": Mar
"AVR. ": Apr
"MAI  ": May
"JUIN ": Jun
"JUIL.": Jul
"AOÛT ": Aug
"SEPT.": Sep
"OCT. ": Oct
"NOV. ": Nov
"DÉC. ": Dec

Should I accept your last answer or this recap?


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.