Looking for a way to "detect json"


(Claudio Kuenzler) #1

On my Docker servers I use the GELF log plugin to write the logs in GELF format to Logstash.
On the Logstash side I prepared the following listener:

input {
  gelf {
    host => "0.0.0.0"
    port => 5000
    type => "docker"
  }
}

The messages, sent to stdout of the container, are sent by the Docker daemon to Logstash's gelf listener. Sometimes the stdout logs of a container can look like this:

10.42.23.37 - - [29/Sep/2017 13:11:55] "GET / HTTP/1.1" 200 -

Sometimes, depending on the application, the logs sent to stdout are in json format, for example:

{"pid":10,"hostname":"b4cf4da50843","level":20,"time":1506690729316,"msg":"Successfully loaded schema 'LDDesign'","v":1}

Of course, when they arrive in json format, I'd like to split up the message into fields (pid, hostname, level, msg, v).
This does work if I apply the following filter:

filter {
    if [type] == "docker" {   
          json { source => "message" }
    }
}

However this now tries to "json-format" every message. Which returns a lot of errors on non-json messages like this:

[2017-09-29T15:16:55,899][WARN ][logstash.filters.json ] Error parsing json {:source=>"message", :raw=>"10.42.159.41 - - [29/Sep/2017 13:16:55] "GET / HTTP/1.1" 200 -\r", :exception=>#<LogStash::Json::ParserError: Unexpected character ('.' (code 46)): Expected space separating root-level values

I came across Unable to avoid JSON parsing errors in logstash log-file which suggests a conditional check "to see if the mssage field looks like json". But the suggested if condition doesn't seem to correctly detect json in my case. I tried:

filter {
    if [type] == "docker" {   
      if [message] =~ "\A\{.+\}\z" {
            json { source => "message" }
      }
   }
}

But now nothing is detected as json and the json-messages are not split into fields.
Is there a general and known way to achieve this; detect json regular expression?


(Magnus Bäck) #2

Your configuration works fine for me:

$ cat test.config 
input { stdin { } }
output { stdout { codec => rubydebug } }
filter {
  if [message] =~ "\A\{.+\}\z" {
    json {
      source => "message"
    }
  }
}
$ ( echo '{"message": "this is JSON"}' ; echo 'this is not JSON' ) | /opt/logstash/bin/logstash -f test.config
Settings: Default pipeline workers: 8
Pipeline main started
{
       "message" => "this is JSON",
      "@version" => "1",
    "@timestamp" => "2017-10-02T05:49:10.406Z",
          "host" => "lnxolofon"
}
{
       "message" => "this is not JSON",
      "@version" => "1",
    "@timestamp" => "2017-10-02T05:49:10.633Z",
          "host" => "lnxolofon"
}
Pipeline main has been shutdown
stopping pipeline {:id=>"main"}

(Claudio Kuenzler) #3

Thanks for your verification, Magnus.
It works now. Only thing I changed was from:

filter {
    if [type] == "docker" {

        if [message] =~ "\A\{.+\}\z" {
          json { source => "message"  }
        }

        # Some non-json messages are now tagged with _jsonparsefailure. We dont care.
        #mutate { remove_tag => [ "_jsonparsefailure" ] }
    }
}

to:

filter {
    if [type] == "docker" {

        if [message] =~ "\A\{.+\}\z" {
          json {
            source => "message"
          }
        }

        # Some non-json messages are now tagged with _jsonparsefailure. We dont care.
        #mutate { remove_tag => [ "_jsonparsefailure" ] }
    }
}

I don't think the line breaks are the cause for this though. Strange. Anyway, it works. Thank again for the confirmation.

For the sake of completeness:

Non-Json message:

# echo '{ "message": "Server is shutting down. It ran for 71s.\r" }' | /usr/share/logstash/bin/logstash -f /tmp/json-test.conf 
13:34:23.841 [[.monitoring-logstash]-pipeline-manager] INFO  logstash.pipeline - Starting pipeline {"id"=>".monitoring-logstash", "pipeline.workers"=>1, "pipeline.batch.size"=>2, "pipeline.batch.delay"=>5, "pipeline.max_inflight"=>2}
13:34:23.846 [[.monitoring-logstash]-pipeline-manager] INFO  logstash.pipeline - Pipeline .monitoring-logstash started
13:34:23.858 [[main]-pipeline-manager] INFO  logstash.pipeline - Starting pipeline {"id"=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>5, "pipeline.max_inflight"=>500}
13:34:23.871 [[main]-pipeline-manager] INFO  logstash.pipeline - Pipeline main started
13:34:23.922 [Api Webserver] INFO  logstash.agent - Successfully started Logstash API endpoint {:port=>9601}
{
    "@timestamp" => 2017-10-02T11:34:23.888Z,
      "@version" => "1",
          "host" => "inf-elk01-t",
       "message" => "Server is shutting down. It ran for 71s.\r"
}

Json Message:

# echo '{ "message": "{\"pid\":8,\"hostname\":\"2fa0a81cebef\",\"level\":40,\"time\":1506929233374,\"msg\":\"Requiring lib/events is deprecated, use liServer.events instead\",\"v\":1}\r" }' | /usr/share/logstash/bin/logstash -f /tmp/json-test.conf 
13:34:42.345 [[.monitoring-logstash]-pipeline-manager] INFO  logstash.pipeline - Starting pipeline {"id"=>".monitoring-logstash", "pipeline.workers"=>1, "pipeline.batch.size"=>2, "pipeline.batch.delay"=>5, "pipeline.max_inflight"=>2}
13:34:42.352 [[.monitoring-logstash]-pipeline-manager] INFO  logstash.pipeline - Pipeline .monitoring-logstash started
13:34:42.370 [[main]-pipeline-manager] INFO  logstash.pipeline - Starting pipeline {"id"=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>5, "pipeline.max_inflight"=>500}
13:34:42.385 [[main]-pipeline-manager] INFO  logstash.pipeline - Pipeline main started
13:34:42.450 [Api Webserver] INFO  logstash.agent - Successfully started Logstash API endpoint {:port=>9601}
{
    "@timestamp" => 2017-10-02T11:34:42.408Z,
      "@version" => "1",
          "host" => "inf-elk01-t",
       "message" => "{\"pid\":8,\"hostname\":\"2fa0a81cebef\",\"level\":40,\"time\":1506929233374,\"msg\":\"Requiring lib/events is deprecated, use liServer.events instead\",\"v\":1}\r"
}

(Claudio Kuenzler) #5

Hello Magnus

After a lot of troubleshooting and trying all kinds of different regex's, it looks to me that the "json filter" doesn't work correctly with nested json input.

Here again my config:

 filter {
    if [type] == "docker" {
        # Remove colors from stdout logging
        mutate {   gsub => [ message , "\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]", "" ] }

        if [message] =~ "^\{.*\}[\s\S]*$" {
          json {
            source => "message"
          }

          mutate {
            add_field => { "Claudio" => "I was detected as json" }
          }
        }

        # Some non-json messages are now tagged with _jsonparsefailure. We dont care.
        mutate { remove_tag => [ "_jsonparsefailure" ] }
    }
}

For example this message is correctly identified as json and runs through the json filter:

{\"pid\":183,\"hostname\":\"6971762d376d\",\"level\":30,\"time\":1508157072618,\"msg\":\"elasticsearch-client: Document with id '1321878' has been inserted into 'https://user:pass@elasticsearch.example.com:9243/prod_article_document/document/1321878'\",\"v\":1}

However a message like this never got into ElasticSearch:

{\"time\":\"2017-10-16T12:28:36.511Z\",\"tags\":[\"activity\",\"application\",\"prod\"],\"action\":\"document.update\",\"data\":{\"user_id\":61,\"article_id\":1322104}}\r

But as soon as I disabled the json part, the message appeared in ES!
So I disabled the json filter in the config:

    if [message] =~ "^\{.*\}[\s\S]*$" {
      #json {
      #  source => "message"
      #}

      mutate {
        add_field => { "Claudio" => "I was detected as json" }
      }
    }

Entry showed up in Kibana/ES:

{
  "_index": "docker-2017.10.16",
  "_type": "docker",
  "_id": "AV8lPcmyyUlRzb_QJl0N",
  "_version": 1,
  "_score": null,
  "_source": {
    "source_host": "52.50.181.45",
    "level": 6,
    "created": "2017-10-02T13:13:09.845813431Z",
    "message": "{\"time\":\"2017-10-16T12:52:07.081Z\",\"tags\":[\"activity\",\"application\",\"prod\"],\"action\":\"document.update\",\"data\":{\"user_id\":226,\"article_id\":1321878}}\r",
    "type": "docker",
    "version": "1.1",
    "command": "/.r/r /sbin/dumb-init bash -c deploy/travis/pre_run_tasks.sh && node index.js",
    "image_name": "images/application-api:5.14.3",
    "protocol": 0,
    "@timestamp": "2017-10-16T12:52:07.081Z",
    "container_name": "r-Application-Api-Application-Api-2-01c74d8b",
    "Claudio": "I was detected as json",
    "host": "onl-radoaie02-p",
    "@version": "1",
    "tag": "docker: r-Application-Api-Application-Api-2-01c74d8b",
    "image_id": "sha256:b9818fdd8155b95054a8712037146d73f62285b1733c7582f920fe786869208b",
    "facility": "gelf-rb",
    "container_id": "ce4425258eeb8056d685833d57bf3322469ed84bf067e64def81234cf09ec05e"
  },
  "fields": {
    "created": [
      1506949989845
    ],
    "@timestamp": [
      1508158327081
    ]
  },
  "highlight": {
    "container_name": [
      "@kibana-highlighted-field@r@/kibana-highlighted-field@-@kibana-highlighted-field@Livingdocs@/kibana-highlighted-field@-@kibana-highlighted-field@Api@/kibana-highlighted-field@-@kibana-highlighted-field@Livingdocs@/kibana-highlighted-field@-@kibana-highlighted-field@Api@/kibana-highlighted-field@-2-01c74d8b"
    ],
    "message": [
      "{\"time\":\"2017-10-16T12:52:07.081Z\",\"tags\":[\"@kibana-highlighted-field@activity@/kibana-highlighted-field@\",\"livingdocs\",\"prod\"],\"action\":\"document.update\",\"data\":{\"user_id\":226,\"article_id\":1321878}}\r"
    ]
  },
  "sort": [
    1508158327081
  ]
}

You can see the special field "Claudio" was added; this means that the regex was successful on this message. So the regex works for both message yet only the first message was split up into fields with the json filter. The second message with nested json was not.

To me it looks like the json filter crashes (? - can't see any parsing problems in the Logstash logs) or can't handle nested json and the message lands in nirvana or somewhere, but definitely not in ES.

Can you or anyone confirm this behaviour? Is this a bug?


(Magnus Bäck) #6

Is there anything in the log about ES rejecting the event? A mapping conflict could cause such behavior.


(Claudio Kuenzler) #7

Actually I was following the Logstash plain log all the time and nothing (no errors or warnings like 'failed to parse json') came up. But I just saw that in the Elasticsearch logfile, a lot of entries like this was written:

[2017-10-16T08:59:55,563][DEBUG][o.e.a.b.TransportShardBulkAction] [inf-elkesi01-p] [docker-2017.10.16][1] failed to execute bulk item (index) BulkShardRequest [[docker-2017.10.16][1]] containing [index {[docker-2017.10.16][docker][AV8j-1gqyUlRzb_Q4VUM], source[{"source_host":"52.50.181.45","data":{"user_id":18},"level":6,"created":"2017-10-12T11:31:32.361634528Z","message":"{\"time\":\"2017-10-16T06:59:55.497Z\",\"tags\":[\"activity\",\"application\",\"staging\"],\"action\":\"user.login.success\",\"data\":{\"user_id\":18}}\r","type":"docker","version":"1.1","command":"/.r/r /sbin/dumb-init bash -c deploy/travis/pre_run_tasks.sh && node index.js","tags":["activity","application","staging"],"image_name":"images/application-api:5.16.3","protocol":0,"@timestamp":"2017-10-16T06:59:55.496Z","container_name":"r-applicationApi-application-Api-3-17a320ce","host":"onl-radoaie03-s","@version":"1","action":"user.login.success","tag":"docker: r-applicationApi-application-Api-3-17a320ce","time":"2017-10-16T06:59:55.497Z","image_id":"sha256:cf3401d9078f13d9927bc006887080528e7f587b95823255f52e30969b7a341a","facility":"gelf-rb","container_id":"f3d57914dfaec2612b48918b27d36537ac4ee5676210ed1d0a0bc04b2c03617c"}]}]
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [time]
[...]

This morning it's another story. The nested json entries appear in ES and are searchable in Kibana (once I refreshed the docker-* index). Hooray!

I assume that yesterday there was a conflict because of my changed regex to detect json (and nested json). As today is a new day, there's a new index. Sometimes patience (or if I would have deleted yesterday's index) solves it.


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.