Parsing issues

nickel43 · December 7, 2021, 8:00pm

Hello,

I have an issue with parsing my data.

Here is a sample of JSON:

{
    "info": {
        "generated_on": "2017-12-03 08:41:42.057563", 
        "slice": "0-999", 
        "version": "v1"
    }, 
    "playlists": [
        {
            "name": "Rock", 
            "collaborative": "false", 
            "pid": 0, 
            "modified_at": 1493424000, 
            "num_tracks": 22, 
            "num_albums": 27, 
            "num_followers": 1, 
            "tracks": [
                {
                    "pos": 0, 
                    "artist_name": "Michael Jackson", 
                    "track_uri": "spotify:track:0UaMYEvWZi0ZqiDOoHU3YI", 
                    "artist_uri": "spotify:d5F5d7go1WT98tk", 
                    "track_name": "Song", 
                    "album_uri": "spotify:album:6vV5Udzzf4Qo2I9K", 
                    "duration_ms": 226863, 
                    "album_name": "The Cookbook"
                }], 
                "num_edits": 34, 
                "duration_ms": 9065801, 
                "num_artists": 37
            },
        {
            "name": "Jazz", 
            "collaborative": "false", 
            "pid": 0, 
            "modified_at": 1493424000, 
            "num_tracks": 22, 
            "num_albums": 27, 
            "num_followers": 1, 
            "tracks": [
                {
                    "pos": 0, 
                    "artist_name": "Whatever", 
                    "track_uri": "spotify:track:0UaMYEvWZi0ZqiDOoHU3YI", 
                    "artist_uri": "spotify:d5F5d7go1WT98tk", 
                    "track_name": "Song", 
                    "album_uri": "spotify:album:6vV5Udzzf4Qo2I9K", 
                    "duration_ms": 226863, 
                    "album_name": "The Cookbook"
                }], 
                "num_edits": 34, 
                "duration_ms": 9065801, 
                "num_artists": 37
            }
        ]
    }

It contains a set of playlist, and playlists contain tracks with several information.

My logstash filter:

input{
 file{
    path => "/home/data/sample.json"
    sincedb_path => "/dev/null"
    start_position => "beginning"
    codec => multiline {
        pattern => "^Spalanzani" 
        negate => true
        what => previous 
        auto_flush_interval => 1}
 }
} 

filter {
    json {
        source => "message"
    }
}


output{
    elasticsearch{
        hosts => "localhost:9200"
        index => "music"
    }
    stdout { }
}

But it not leading me to anywhere.

What I want the output to be in Kibana is 1 hit for each track with all the above fields (artist name etc.) but ALSO a field that refers to the playlist the track si linked to (playlist.name and playlist.pid).

I also have no interested in the very first part of the file ("info")

I am having a terrible time doing the logstash filter for this.

Does anybody know what should I do?

Thank you

Badger · December 7, 2021, 8:28pm

The first thing to do would be

split { field => "playlists" }
split { field => "[playlists][tracks]" }

at that point you will have one event for each track and I think each one will look like this (AWS US East is down this afternoon, or I would test it)

"playlists":  {
        "name": "Rock", 
        "collaborative": "false", 
        "pid": 0, 
        "modified_at": 1493424000, 
        "num_tracks": 22, 
        "num_albums": 27, 
        "num_followers": 1, 
        "tracks":  {
                "pos": 0, 
                "artist_name": "Michael Jackson", 
                "track_uri": "spotify:track:0UaMYEvWZi0ZqiDOoHU3YI", 
                "artist_uri": "spotify:d5F5d7go1WT98tk", 
                "track_name": "Song", 
                "album_uri": "spotify:album:6vV5Udzzf4Qo2I9K", 
                "duration_ms": 226863, 
                "album_name": "The Cookbook"
            }, 
            "num_edits": 34, 
            "duration_ms": 9065801, 
            "num_artists": 37
        },

So then you can

mutate {
    rename => {
        "[playlists][tracks]" => "track"
        "[playlists][name]" => "[playlist][name]"
        "[playlists][pid]" => "[playlist][pid]"
    }
    remove_field => "playlists"
}

If you want the fields inside [track] to be at the top level use a ruby filter like this.

nickel43 · December 7, 2021, 10:20pm

Hello @Badger and thanks for the very fast reply.

I added split as you mentioned, and it turns out this works with the sample I provided above, but not with my complete files (600.000 lines each and 30 MiB average).

I did add max_bytes and max_lines in the codec multiline but it seems split doesn't want to work with big files like that.

Have you encountered this before? Any suggestion?

Thank you

Badger · December 7, 2021, 10:48pm

In old versions (before this PR was merged, which I think is filter version 3.1.8 and logstash 7.5) the split filter would iterate over the array, cloning the event and overwriting the array with a single entry. For very large arrays, cloning the array and immediately deleting it on each iteration results in monstrous memory allocation rates and long GC delays.

Other than that, no.

nickel43 · December 8, 2021, 11:14am

Hello @Badger

Here is the logstash output when using a complete file:

exception=>#<LogStash::Json::ParserError: Unexpected end-of-input: expected close marker for Object (start marker at [Source: (byte[])"{
    "info": {
        "generated_on": "2017-12-03 08:41:42.057563", 
        "slice": "0-999", 
        "version": "v1"
    }, 
    "playlists": [
        {
            "name": "Throwbacks", 
            "collaborative": "false", 
            "pid": 0, 
            "modified_at": 1493424000, 
            "num_tracks": 52, 
            "num_albums": 47, 
            "num_followers": 1, 
            "tracks": [
                {
                    "pos": 0, 
                    "artist_name": ""[truncated 34118866 bytes]; line: 1, column: 1])
 at [Source: (byte[])"{
    "info": {
        "generated_on": "2017-12-03 08:41:42.057563", 
        "slice": "0-999", 
        "version": "v1"
    }, 
    "playlists": [
        {
            "name": "Throwbacks", 
            "collaborative": "false", 
            "pid": 0, 
            "modified_at": 1493424000, 
            "num_tracks": 52, 
            "num_albums": 47, 
            "num_followers": 1, 
            "tracks": [
                {
                    "pos": 0, 
                    "artist_name": ""[truncated 34118866 bytes]; line: 689058, column: 34119372]>}
[2021-12-08T06:01:56,759][WARN ][logstash.filters.split   ][main][2db25766a2e2a20dfe94279bf0ea70bddcaf131311c7db2fe09e225bf5560f53] Only String and Array types are splittable. field:playlists is of type = NilClass
[2021-12-08T06:01:56,760][WARN ][logstash.filters.split   ][main][3b9d9036dd11da5abb9b839b48e7d56d1b62c8c980f3cd831c83aac420f991d9] Only String and Array types are splittable. field:[playlists][tracks] is of type = NilClass
{
      "@version" => "1",
          "tags" => [
        [0] "multiline",
        [1] "_jsonparsefailure",
        [2] "_split_type_failure"
    ],
          "path" => "/home/user/Downloads/data/file.json",
          "host" => "user",
    "@timestamp" => 2021-12-08T11:01:51.655Z
}

My logstash.conf:

input{
 file{
    path => "/home/user/Downloads/data/file.json"
    sincedb_path => "/dev/null"
    start_position => "beginning"
    codec => multiline {
        pattern => "^Spalanzani"
        negate => true
        what => previous
        auto_flush_interval => 1
        max_lines => 10000000000000
        max_bytes => "5000 MiB"}

 }
}

filter {
    json { source => "message"}
    split { field => "playlists" }
    split { field => "[playlists][tracks]" }
    mutate {
        remove_field => [ "message" ]
    }

}

output{
    elasticsearch{
        hosts => "localhost:9200"
        index => "music"
    }
    stdout { }
}

This filter works with the example I show in the first message of this topic. I googled the error and it suggests my data is not correct (missing brackets and such) but i double checked and the data is correct.

nickel43 · December 8, 2021, 3:12pm

Sorry for double post, I think I found what was wrong.

I added a line break '\n' at the beginning of the json file and now it works with the same filtre as i mentionned in my last message. Any idea why? My data is correctly parsed now

nickel43 · December 8, 2021, 4:32pm

UPDATE:

Sorry for 3rd answer in a row.

The fact that it works is not thanks to the line break.

It is because of the fact that I edit the file then save it. Each time it treats a file that I haven't touched, I get the splitting errors. But as soon as I add a line break then save, then deletes that very same line break then save, it works. That's super weird. it looks like i need to save the files at least once. Trying to write a script right now because i have 1000 files. If you know why i need to do that, i would be very intersted to know

system · January 5, 2022, 4:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Parsing issues Logstash	2	322	December 17, 2021
Parse json data with logstash Logstash	8	8213	March 23, 2022
Parsing json array Logstash	6	312	June 16, 2021
Json array parsing Logstash	1	393	July 6, 2017
Parse Json Logs using logstash Logstash	2	285	March 25, 2021

Parsing issues

Related topics