Logstash parshing unicode characters

hi, we are running into parsing errors when sending data through logstash pipelines. After digging a little, found that the source system has been sending a lot of unicode special characters - something like below:

\u0006\u0000\u0000\u0000\u0000\u0000\u0000\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\b�������\u0006\u0000\u0000\u0000\u0000\u0000\u0000\u0000\r\u0006\u0000\u0000\u0000\u0000\u0000\u0000\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\a�������\u0006\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0013\u0006\u0000\u0000\u0000\u0000\u0000\u0000\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0006�������\u0006\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0019\u0006\u0000\u0000\u0000\u0000\u0000\u0000\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0005�������\u0006\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u001F\u0006\u0000\u0000\u0000\u0000\u0000\u0000\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004�������\u0006\u0000\u0000\u0000\u0000\u0000\u0000\u0000%\u0006\u0000\u0000\u0000\u0000\u0000\u0000\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0003�������\u0006\

Is there a logstash grok plugin available that can remove those special unicode characters?

Thanks!

I could really use some guidance with the above issue. Any suggestions please?

Have you tried mutate gsub filter:

    filter {
      mutate {
        gsub => [
          "fieldname", "[\u0000\u0001]", "",
        ]
      }
    }

?

Thanks Tomo. Yes, I tried using gsub filter quite often in the past. But in this particular case, these unicode strings could vary with each different character and so I was looking for something generic to use. The fieldnames could be \u0000 or any other character/number after "\u". It will be highly laborious to go over each unicode character and use the gsub filter (also guessing it's not effective).

Are there so many special characters?
How to distinguish from useful characters.

That's where I was looking to get some help :slight_smile:

I was hoping there is some readily available filter to tackle those unicode characters.

How about:

    filter {
      mutate {
        gsub => [
          "fieldname", "[\u0000-\u001F]", "",
        ]
      }
    }

?

There was no ignore option on plain codec plugin. So it seems you have to specify the characters by regex. I suppose it is not so many special characters generated from your system.

1 Like

No luck yet. Still errors in parsing the message:

 at [Source: (byte[])"{"id":"79f9856b-ba63-41fd-9665-b8484086cb5c","message":"Claimed ....

Can you share the full error you are receiving?

Is that unicode string part of your message or it is your entire message this way?

Is every message like this or just some random message will give you this error? Can you share an example of a message that is giving you the error?

What is your logstash pipeline?

Here is the sample test message (removed sensitive info):

{"app":"test-api[v1]","hostname":"test-api-blue04-dc1-zn1","port":10610,"@timestamp":"2022-02-10T10:56:29.205-06:00","logLevel":"INFO","threadName":"Default Executor-thread-2453","loggerName":"com.rest.util.RestLoggingHelper","message":"restType=RESPONSE|requestUri=/test-api/credit-card/aclaims|responseCode=200|responseStatus=OK|requestBody=[{\"channel\":\"BBB\",\"criteria\":[{\"token\":\"XXYXYXYXYXY\",\"referenceDate\":\"2022-01-06\",\"amount\":1031.22,\"authorizationProperties\":{\"customerNumber\":\"1515151515\",\"orderNumber\":\"151515151\"}}]}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        \u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000]|responseBody=[{\"id\":\"79f9856b-ba63-41fd-9665-b8484086cb5c\",\"message\":\"Claimed 1 authorization out of 1 unclaimed matches, 0 matches were already claimed.\",\"claimedAuthorization\":{\"id\":\"79f9856b-2233-41fd-9665-b8484086cb5c\",\"token\":\"222233334444\",\"expirationMMYY\":\"0000\",\"referenceDate\":\"2022-01-06\",\"amount\":1031.22,\"currencyCode\":\"USD\",\"nameOnCard\":\"test user\",\"telephoneNumber\":\"21212121212\"}

Note that a bunch of spaces right infront of \u0000\u0000, which is being trimmed here once I paste the msg.

Logstash config (filter section):

filter {

        json{
                source => "message"
        }

        if "beats_input_codec_plain_applied" in [tags] {
        mutate {
            remove_tag => "beats_input_codec_plain_applied"
        }
    }

      mutate {
        gsub => [
          "message", "[\\]u0000", ""
        ]
      }
..
<rest of the log processing here >

Error message:


[2022-02-15T13:18:48,537][WARN ][logstash.filters.json    ][pipeline] Error parsing json {:source=>"message", :raw=>"{\"id\":\"79f9856bxyxyxyx-b8484086cb5c\",\"message\":\"Claimed 1 authorization out of 1 unclaimed matches, 0 matches were already claimed.\",\"claimedAuthorization\":{\"id\":\"79f9856bxyxyxyx41fd-9665-b8484086cb5c\",\"token\":\"xyxyxyx\",\"expirationMMYY\":\"0000\",\"referenceDate\":\"2022-01-06\",\"amount\":1031.22,\"currencyCode\":\"USD\",\"nameOnCard\":\"test user\",\"telephoneNumber\":\"12121212121\",\"address1\":\"PO BOX 121212121\",\"city\":\"test\",\"stateOrProvinceCode\":\"CA\",\"postalCode\":\"123121\",\"countryCode\":\"US\",\"channel\":\"WEB\",\"authorizationProperties\":{\"orderNumber\":\"12112121\",\"webCartId\":\"\",\"webSessionId\":\"cdfaf57a-121212121a2bf-2d576c4e9980\",\"webIP\":\"121.121.12.12\",\"invoiceNumber\":\"121212121\",\"customerNumber\":\"1212121\"},\"verified\":true,\"authorized\":true,\"authorizationResponseCode\":\"100\",\"authorizationResponseDescription\":\"APPROVED\",\"authorizationCode\":\"121212\",\"addressVerificationResponseCode\":\"I4\",\"level3\":\"N\",\"commercialCreditCard\":\"Y\",\"overrideModeOfPayment\":\"MC\",\"transactionId\":null,\"cvsResponse\":null,\"cvvMatchResponseCode\":null,\"createdTimestamp\":\"2022-02-10T10:55:53.115-06:00[America/Chicago\\\",\"claimed\":true,\"claimedTimestamp\":\"2022-02-10T10:56:29.198-06:00[America/Chicago\\\",\"claimedRequestId\":\"12121212121-1212121-4c5f-94d\",\"merchantOrderNumber\":\"1212121\"}}]", :exception=>#<LogStash::Json::ParserError: Unexpected character ('c' (code 99)): was expecting comma to separate Object entries
 at [Source: (byte[])"{"id":"79f9856b-ba63-41fd-9665-b8484086cb5c","message":"Claimed 1 authorization out of 1 unclaimed matches, 0 matches were already claimed.","claimedAuthorization":{"id":"79f9856b-ba63-12121-9665-121212121","token":"1212121211","expirationMMYY":"0000","referenceDate":"2022-01-06","amount":1031.22,"currencyCode":"USD","nameOnCard":"test user","telephoneNumber":"121212121","address1":"PO BOX 121212121","city":"test","stateOrProvinceCode":"CA","postalCode":"12121","countryCode":"US","[truncated 780 bytes]; line: 1, column: 1107]>}

try gsub before json filter..

"createdTimestamp":"2022-02-10T10:55:53.115-06:00[America/Chicago\","claimed":true,

You have a backslash at the end of createdTimestamp. That escapes the quote, so the value of createdTimestamp is 2022-02-10T10:55:53.115-06:00[America/Chicago\", and the parser blows up when it tries to interpret claimed. That is what causes

Unexpected character ('c' (code 99)): was expecting comma to separate Object entries

1 Like

Tried to remove the backslash and using gsub before json - still no luck!

When I tested with this message - I see it's working, which has backslash at the end of createdTimestamp. But when I used the original values, it's blowing up.

{"app":"test-api[v1]","hostname":"test-api-blue04","port":10610,"@timestamp":"2022-02-16T13:56:29.205-06:00","logLevel":"INFO","threadName":"Default Executor-thread-2453","loggerName":"com.ha.rest.RestLoggingHelper","message":"restType=RESPONSE|requestUri=/test-api/cc/auth|responseCode=200|responseStatus=OK|requestBody=[{\"channel\":\"CCC\",\"criteria\":[{\"token\":\"XXXXXXXXXX\",\"referenceDate\":\"2022-01-06\",\"amount\":0000.22,\"authorizationProperties\":{\"customerNumber\":\"1111111\",\"orderNumber\":\"211212121\"}}]}        \u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000]|responseBody=[{\"id\":\"121212121-12121-12121-b8484086cb5c\",\"message\":\"Claimed 1 authorization out of 1 unclaimed matches, 0 matches were already claimed.\",\"claimedAuthorization\":{\"id\":\"121212121-12121-12121-b8484086cb5c\",\"token\":\"121212121121212\",\"expirationMMYY\":\"0000\",\"referenceDate\":\"2022-01-06\",\"amount\":0000.22,\"currencyCode\":\"USD\",\"nameOnCard\":\"test t. user\",\"telephoneNumber\":\"121212121212\",\"address1\":\"PO BOX 1110000\",\"city\":\"test\",\"stateOrProvinceCode\":\"CA\",\"postalCode\":\"00000\",\"countryCode\":\"US\",\"channel\":\"CCC\",\"authorizationProperties\":{\"orderNumber\":\"1212121212\",\"webCartId\":\"\",\"webSessionId\":\"cdfaf57a-853b-431e-a2bf-12121212121\",\"webIP\":\"000.000.10.00\",\"invoiceNumber\":\"12121212\",\"customerNumber\":\"12121212\"},\"verified\":true,\"authorized\":true,\"authorizationResponseCode\":\"100\",\"authorizationResponseDescription\":\"APPROVED\",\"authorizationCode\":\"1212121\",\"addressVerificationResponseCode\":\"I4\",\"level3\":\"N\",\"commercialCreditCard\":\"Y\",\"overrideModeOfPayment\":\"MC\",\"transactionId\":null,\"cvsResponse\":null,\"cvvMatchResponseCode\":null,\"createdTimestamp\":\"2022-02-13T10:55:53.115-06:00[America/Chicago]\",\"claimed\":true,\"claimedTimestamp\":\"2022-02-13T10:56:29.198-06:00[America/Chicago]\",\"claimedRequestId\":\"121212121-d8ae-4c5f-94d9-986393bd1406\",\"merchantOrderNumber\":\"12121212121\"}}]","originatorApp":"test-api","correlation-id":"121212121212121","originatorContext":"/credit-card/authorization-claims","userName":"test","request-id":"9Q8V1n5WRmmt-Xpj4EmlXw","userId":""}

This tells the parser is skipping the unicode characters, and so must be failing at some other field. Still scratching my head which field is it - just wish logstash shows where exactly is it erroring.

When I send the original message, it appears to be erroring right here but can't figure out why.

exception=>#<LogStash::Json::ParserError: Unexpected character ('c' (code 99)): was expecting comma to separate Object entries
 at [Source: (byte[])"{"id":"79f9856b-ba63-41fd-9665-b8484086cb5c","message":"Claimed 1 authorization out of 1 unclaimed matches, 0 matches were already claimed.","claimedAuthorization":{"id":"79f9856b-ba63-41fd-9665-b8484086cb5c","token":"111111QXZEFC1111","expirationMMYY":"0000","referenceDate":"2022-00-00","amount":1031.22,"currencyCode":"USD","nameOnCard":"TEST T TEST","telephoneNumber":"0000000000","address1":"PO BOX 0000","city":"TEST","stateOrProvinceCode":"CA","postalCode":"00000","countryCode":"US","[truncated 780 bytes]; line: 1, column: 1107]>}

Hoping someone would get a chance to look at my last update and see if you have any other suggestions!

You have not shown us the message that it is failing on. I cannot speculate on what the error might be. Ideally, show the field you are trying to parse from the output of

output { stdout { codec => rubydebug } }

Sorry, I had to scrape all the confidential information before posting here. But this is what I see in the logstash log once I enabled the stdout output section:

[2022-02-23T14:05:48,918][INFO ][logstash.javapipeline    ][filebeat-logstash1] Pipeline started {"pipeline.id"=>"filebeat-logstash1"}
[2022-02-23T14:05:49,056][INFO ][logstash.agent           ] Pipelines running {:count=>1, :running_pipelines=>[:"filebeat-logstash1"], :non_running_pipelines=>[]}
[2022-02-23T14:05:49,059][INFO ][filewatch.observingtail  ][filebeat-logstash1] START, creating Discoverer, Watch with file and sincedb collections
[2022-02-23T14:05:49,997][WARN ][logstash.filters.json    ][filebeat-logstash1] Error parsing json {:source=>"message", :raw=>"{\"id\":\"1f9f9f9f9-ba63-41fd-9665-b8484086cb5c\",\"message\":\"Claimed 1 authorization out of 1 unclaimed matches, 0 matches were already claimed.\",\"claimedAuthorization\":{\"id\":\"1f9f9f9f9-ba63-41fd-9665-b8484086cb5c\",\"token\":\"5111114QXZEFC1222\",\"expirationMMYY\":\"1111\",\"referenceDate\":\"2022-01-06\",\"amount\":1031.22,\"currencyCode\":\"USD\",\"nameOnCard\":\"User T Test\",\"telephoneNumber\":\"1111111111\",\"address1\":\"PO BOX 1111\",\"city\":\"CITY\",\"stateOrProvinceCode\":\"CA\",\"postalCode\":\"11111\",\"countryCode\":\"US\",\"channel\":\"WEB\",\"authorizationProperties\":{\"orderNumber\":\"64556572\",\"webCartId\":\"\",\"webSessionId\":\"cdfaf57a-853b-431e-a2bf-2d576c4e9980\",\"webIP\":\"111.111.11.11\",\"invoiceNumber\":\"111111111\",\"customerNumber\":\"1111111\"},\"verified\":true,\"authorized\":true,\"authorizationResponseCode\":\"100\",\"authorizationResponseDescription\":\"APPROVED\",\"authorizationCode\":\"038394\",\"addressVerificationResponseCode\":\"I4\",\"level3\":\"N\",\"commercialCreditCard\":\"Y\",\"overrideModeOfPayment\":\"MC\",\"transactionId\":null,\"cvsResponse\":null,\"cvvMatchResponseCode\":null,\"createdTimestamp\":\"2022-02-10T10:55:53.115-06:00[America/Chicago\\\",\"claimed\":true,\"claimedTimestamp\":\"2022-02-10T10:56:29.198-06:00[America/Chicago\\\",\"claimedRequestId\":\"33a98688-d8ae-4c5f-94d9-986393bd1406\",\"merchantOrderNumber\":\"64556572\"}}]", :exception=>#<LogStash::Json::ParserError: Unexpected character ('c' (code 99)): was expecting comma to separate Object entries
 at [Source: (byte[])"{"id":"1f9f9f9f9-ba63-41fd-9665-b8484086cb5c","message":"Claimed 1 authorization out of 1 unclaimed matches, 0 matches were already claimed.","claimedAuthorization":{"id":"1f9f9f9f9-ba63-41fd-9665-b8484086cb5c","token":"5111114QXZEFC1222","expirationMMYY":"1111","referenceDate":"2022-01-06","amount":1031.22,"currencyCode":"USD","nameOnCard":"User T Test","telephoneNumber":"1111111111","address1":"PO BOX 1111","city":"CITY","stateOrProvinceCode":"CA","postalCode":"11111","countryCode":"US","[truncated 780 bytes]; line: 1, column: 1107]>}
{
             "restType" => "RESPONSE",
             "userName" => "APIUSER",
         "responseCode" => 200,
           "loggerName" => "com.comp.ha.rest.util.RestLoggingHelper",
                 "tags" => [
        [0] "file-based",
        [1] "_jsonparsefailure"
    ],
             "@version" => "1",
          "logFilePath" => "%{[log][file][path]}",
             "logLevel" => "info",
                 "port" => 10610,
       "correlation-id" => "A4844003AB991A7AA9AB0004AC1C0C7C",
                 "zone" => "backoffice",
           "@timestamp" => 2022-02-23T19:56:29.205Z,
              "message" => "RESPONSE BODY : %{parsedccclaimrespjson}",
             "hostname" => "%{[host][name]}",
           "request-id" => "9Q8V1n5WRmmt-Xpj4EmlXw",
        "originatorApp" => "test-api",
               "userId" => "",
                  "jvm" => "test-api-blue04-dc1-zn1",
           "threadName" => "Default Executor-thread-2453",
                 "path" => "/pathfilebeat/config/test.log",
                "appApp" => "test-api[v1]",
       "responseStatus" => "OK",
                 "type" => "log",
    "originatorContext" => "/credit-card/authorization-claims"
}
[2022-02-23T14:05:51,576][DEBUG][logstash.outputs.elasticsearch][filebeat-logstash1] Sending final bulk request for batch. {:action_count=>1, :payload_size=>835, :content_length=>835, :batch_offset=>0}

When I sent the working message with all test data in it, here is the output:

[2022-02-23T14:18:34,182][WARN ][logstash.filters.json    ][logstash-filebeat1] Parsed JSON object/hash requires a target configuration option {:source=>"message", :raw=>""}
{
       "@version" => "1",
       "hostname" => "%{[host][name]}",
           "type" => "log",
           "tags" => [
        [0] "_jsonparsefailure",
        [1] "file-based"
    ],
        "message" => "",
           "zone" => "backoffice",
    "logFilePath" => "%{[log][file][path]}",
           "path" => "path/logstashpipline/config/test2.log",
     "@timestamp" => 2022-02-23T20:18:34.038Z
}
{
        "originatorApp" => "test-api",
    "originatorContext" => "/credit-card/authorization-claims",
             "userName" => "test",
           "loggerName" => "com.ha.rest.RestLoggingHelper",
             "logLevel" => "info",
             "@version" => "1",
               "userId" => "",
             "hostname" => "%{[host][name]}",
                 "type" => "log",
              "message" => "restType=RESPONSE|requestUri=/test-api/cc/auth|responseCode=200|responseStatus=OK|requestBody=[{\"channel\":\"CCC\",\"criteria\":[{\"token\":\"XXXXXXXXXX\",\"referenceDate\":\"2022-01-06\",\"amount\":0000.22,\"authorizationProperties\":{\"customerNumber\":\"1111111\",\"orderNumber\":\"211212121\"}}]}        \u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000]|responseBody=[{\"id\":\"121212121-12121-12121-b8484086cb5c\",\"message\":\"Claimed 1 authorization out of 1 unclaimed matches, 0 matches were already claimed.\",\"claimedAuthorization\":{\"id\":\"121212121-12121-12121-b8484086cb5c\",\"token\":\"121212121121212\",\"expirationMMYY\":\"0000\",\"referenceDate\":\"2022-01-06\",\"amount\":0000.22,\"currencyCode\":\"USD\",\"nameOnCard\":\"test t. user\",\"telephoneNumber\":\"121212121212\",\"address1\":\"PO BOX 1110000\",\"city\":\"test\",\"stateOrProvinceCode\":\"CA\",\"postalCode\":\"00000\",\"countryCode\":\"US\",\"channel\":\"CCC\",\"authorizationProperties\":{\"orderNumber\":\"1212121212\",\"webCartId\":\"\",\"webSessionId\":\"cdfaf57a-853b-431e-a2bf-12121212121\",\"webIP\":\"000.000.10.00\",\"invoiceNumber\":\"12121212\",\"customerNumber\":\"12121212\"},\"verified\":true,\"authorized\":true,\"authorizationResponseCode\":\"100\",\"authorizationResponseDescription\":\"APPROVED\",\"authorizationCode\":\"1212121\",\"addressVerificationResponseCode\":\"I4\",\"level3\":\"N\",\"commercialCreditCard\":\"Y\",\"overrideModeOfPayment\":\"MC\",\"transactionId\":null,\"cvsResponse\":null,\"cvvMatchResponseCode\":null,\"createdTimestamp\":\"2022-02-13T10:55:53.115-06:00[America/Chicago]\",\"claimed\":true,\"claimedTimestamp\":\"2022-02-13T10:56:29.198-06:00[America/Chicago]\",\"claimedRequestId\":\"121212121-d8ae-4c5f-94d9-986393bd1406\",\"merchantOrderNumber\":\"12121212121\"}}]",
          "logFilePath" => "%{[log][file][path]}",
           "threadName" => "Default Executor-thread-2453",
                 "port" => 10610,
                  "jvm" => "test-api-blue04",
                  "app" => "test-api[v1]",
                 "tags" => [
        [0] "file-based"
    ],
                 "zone" => "backoffice",
           "request-id" => "9Q8V1n5WRmmt-Xpj4EmlXw",
                 "path" => "path/logstashpipline/config/test2.log",
           "@timestamp" => 2022-02-23T19:56:29.205Z,
       "correlation-id" => "121212121212121"
}

The event contents after the message has been parsed does not help me, and the error message truncates the JSON. Can you post the log entry that logstash is ingesting?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.