Parsing an html inside a Json

Hi,

** a longer explanation of the problem is in the second response to @Badger **

I need to parse a log with a JSON that contain a field which contains an HTML document, ex :

2023-03-04 20:20:06,817 [http-nio-8080-exec-6] WARN com.pilaty.controller.order.pilatyController - {"IP":"some_ip",
"URL":"some_url",
"METHODE":"POST",
"INPUT": "some_pyload", 
"OUTPUT": "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html......</html>"
}

If i don't have an HTML in the OUTPUT field I parse the log correctly, otherwise the JSON parse fails, my filter :

  grok {
    match => {
      "message" => "%{TIMESTAMP_ISO8601:timestamp}\s+%{DATA:thread}\s+%{DATA:loglevel}\s+%{DATA:class}\s+-\s+%{GREEDYDATA:jsondata}"
    }
  }


  json {
    source => "jsondata"
    target => "parsed_json"
  }

any help appreciated, thanks.

I would suggest

    grok { match => { "message" => "%{TIMESTAMP_ISO8601:timestamp}\s+\[%{DATA:thread}\]\s+%{LOGLEVEL:loglevel}\s+%{JAVACLASS:class}\s+-\s+%{GREEDYDATA:[@metadata][restOfLine]}" } }
    if [message] =~ /"OUTPUT"/ {
        grok { match => { "[@metadata][restOfLine]" => '"OUTPUT": "%{GREEDYDATA:[@metadata][xml]}"\n' } }
        mutate { gsub => [ "[@metadata][restOfLine]", ', \n"OUTPUT": "[^\n]*"\n', "" ] }
    }
    json { source => "[@metadata][restOfLine]" target => "parsed_json" }
    xml { source => "[@metadata][xml]" target => "theXML" force_array => false }
1 Like

Thanks for your reponse maybe I am not explianing that well, so let me do it clairly.

I have this log as the original log sent by filebeat :

"2023-03-06 00:28:12,966 \u001b[32m[http-nio-8080-exec-2]\u001b[0;39m \u001b[34mINFO\u001b[0;39m com.Slity.logging.LoggingControllerAspect - {\"ip\":\"x.x.x.196\",\"route\":\"/some_path/check\",\"methode\":\"GET\",\"canal\":\"Android\",\"uuid\":\"a55332ab49c8\",\"os\":\"Android\",\"locale\":\"tz\",\"appVersion\":\"xxx\",\"osOrigin\":\"13\",\"message\":\"[SLITY-LOG]\",\"uri\":\"/some_path/check\",\"statusCode\":\"500 INTERNAL_SERVER_ERROR\",\"headers\":\"{\\\"canal\\\":\\\"\\\",\\\"uuid\\\":\\\"\\\",\\\"user\\\":\\\"\\\",\\\"payload\\\":\\\"eSomedataSomedata==\\\",\\\"authorization\\\":\\\"SometokenSometoken\\\",\\\"CatToken\\\":\\\"\\\"}\",\"input\":\"\\\"\\\\{Cat=\\\\\\\"97345620323\\\\\\\"}\\\"\",\"output\":\"<!DOCTYPE html PUBLIC \\\"-//W3C//DTD XHTML 1.0 Strict//EN\\\" \\\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\\\"><html xmlns=\\\"http://www.w3.org/1999/xhtml\\\"><head><title>SLITY3 Micro  5.2021.5 #bfisher- Error report</title><style type=\\\"text/css\\\"><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}  B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P{font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 404 - </h1><hr/><p><b>type</b> Status report </p><p><b>message </b></p><p><b>description </b>The requested resource is not available.</p><hr/><h3>SLITY3 Micro  5.2021.5 #bfisher</h3></body></html>\\\"======= >  404 Not Found: \\\"<!DOCTYPE html PUBLIC \\\"-//W3C//DTD XHTML 1.0 Strict//EN\\\" \\\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\\\"><html xmlns=\\\"http://www.w3.org/1999/xhtml\\\"><head><title>SLITY3 Micro  5.2021.5 #bfischer- Error report</title><style type=\\\"text/css\\\"><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}  {color : #525D76;}--></style> </head><body><h1>HTTP Status 404 - </h1></b>The requested resource is not available.</p><hr/><h3>SLITY3  #bofisher</h3></body></html>\",\"serviceName\":\"MSSLITY\"}"

I need to extract all fields from this this message so I decide to break it in two parts the first part wich is a simple pattern and the second part wich is a JSON and this second part what's causing problem for two reasons :

  • The backslash escape charachter added, I managed to get ride of them with a mutate/gsub filter.

  • The JSON contain a field with HTML that contain double quotes, so it breaks the parsing of the hole JSON and I try to extract this HTML field replace all double quotes that causing problem and inject it on the JSON before parse it with JSON/ruby filter.

so my filter looks like this :

filter {
  # your other filters here

  mutate {
    gsub => [       
			  "message", "[\\]", "",
			  "message", "\"{", "{",
			  "message", "}\"", "}",
			  "message", "Cat=\"(\+?\d{8,14})\"", "Cat='\1'",
			  #"message", "<!DOCTYPE html.*<\/html>","+++++++++++++++++++++++++++++++" # remove this breaks the JSON parsing
	]
  }


  grok {
    match => {
      "message" => "%{TIMESTAMP_ISO8601:timestamp}\s+%{DATA:thread}\s+%{DATA:loglevel}\s+%{DATA:class}\s+-\s+%{GREEDYDATA:jsondata}"
    }
  }


  json {
    source => "jsondata"
    target => "parsed_json"
  }

ruby {
  code => '
    event.get("parsed_json").each { |k, v|
      if !v.nil?
        event.set(k, v)
      end
    }
    event.remove("parsed_json")
  '
}

	
  mutate {
    remove_field => ["jsondata"]
	remove_field => ["parsed_json"]
  }

}

Hope this explain better my struggle, the ruby code is from your answers on this forum so you help me twice :wink:

Regards.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.