Parsing issues

Hello,

I have an issue with parsing my data.

Here is a sample of JSON:

{
    "info": {
        "generated_on": "2017-12-03 08:41:42.057563", 
        "slice": "0-999", 
        "version": "v1"
    }, 
    "playlists": [
        {
            "name": "Rock", 
            "collaborative": "false", 
            "pid": 0, 
            "modified_at": 1493424000, 
            "num_tracks": 22, 
            "num_albums": 27, 
            "num_followers": 1, 
            "tracks": [
                {
                    "pos": 0, 
                    "artist_name": "Michael Jackson", 
                    "track_uri": "spotify:track:0UaMYEvWZi0ZqiDOoHU3YI", 
                    "artist_uri": "spotify:d5F5d7go1WT98tk", 
                    "track_name": "Song", 
                    "album_uri": "spotify:album:6vV5Udzzf4Qo2I9K", 
                    "duration_ms": 226863, 
                    "album_name": "The Cookbook"
                }], 
                "num_edits": 34, 
                "duration_ms": 9065801, 
                "num_artists": 37
            },
        {
            "name": "Jazz", 
            "collaborative": "false", 
            "pid": 0, 
            "modified_at": 1493424000, 
            "num_tracks": 22, 
            "num_albums": 27, 
            "num_followers": 1, 
            "tracks": [
                {
                    "pos": 0, 
                    "artist_name": "Whatever", 
                    "track_uri": "spotify:track:0UaMYEvWZi0ZqiDOoHU3YI", 
                    "artist_uri": "spotify:d5F5d7go1WT98tk", 
                    "track_name": "Song", 
                    "album_uri": "spotify:album:6vV5Udzzf4Qo2I9K", 
                    "duration_ms": 226863, 
                    "album_name": "The Cookbook"
                }], 
                "num_edits": 34, 
                "duration_ms": 9065801, 
                "num_artists": 37
            }
        ]
    }

It contains a set of playlist, and playlists contain tracks with several information.

My logstash filter:

input{
 file{
    path => "/home/data/sample.json"
    sincedb_path => "/dev/null"
    start_position => "beginning"
    codec => multiline {
        pattern => "^Spalanzani" 
        negate => true
        what => previous 
        auto_flush_interval => 1}
 }
} 

filter {
    json {
        source => "message"
    }
}


output{
    elasticsearch{
        hosts => "localhost:9200"
        index => "music"
    }
    stdout { }
}


But it not leading me to anywhere.

What I want the output to be in Kibana is 1 hit for each track with all the above fields (artist name etc.) but ALSO a field that refers to the playlist the track si linked to (playlist.name and playlist.pid).

I also have no interested in the very first part of the file ("info")

I am having a terrible time doing the logstash filter for this.

Does anybody know what should I do?

Thank you

The first thing to do would be

split { field => "playlists" }
split { field => "[playlists][tracks]" }

at that point you will have one event for each track and I think each one will look like this (AWS US East is down this afternoon, or I would test it)

"playlists":  {
        "name": "Rock", 
        "collaborative": "false", 
        "pid": 0, 
        "modified_at": 1493424000, 
        "num_tracks": 22, 
        "num_albums": 27, 
        "num_followers": 1, 
        "tracks":  {
                "pos": 0, 
                "artist_name": "Michael Jackson", 
                "track_uri": "spotify:track:0UaMYEvWZi0ZqiDOoHU3YI", 
                "artist_uri": "spotify:d5F5d7go1WT98tk", 
                "track_name": "Song", 
                "album_uri": "spotify:album:6vV5Udzzf4Qo2I9K", 
                "duration_ms": 226863, 
                "album_name": "The Cookbook"
            }, 
            "num_edits": 34, 
            "duration_ms": 9065801, 
            "num_artists": 37
        },

So then you can

mutate {
    rename => {
        "[playlists][tracks]" => "track"
        "[playlists][name]" => "[playlist][name]"
        "[playlists][pid]" => "[playlist][pid]"
    }
    remove_field => "playlists"
}

If you want the fields inside [track] to be at the top level use a ruby filter like this.

Hello @Badger and thanks for the very fast reply.

I added split as you mentioned, and it turns out this works with the sample I provided above, but not with my complete files (600.000 lines each and 30 MiB average).

I did add max_bytes and max_lines in the codec multiline but it seems split doesn't want to work with big files like that.

Have you encountered this before? Any suggestion?

Thank you

In old versions (before this PR was merged, which I think is filter version 3.1.8 and logstash 7.5) the split filter would iterate over the array, cloning the event and overwriting the array with a single entry. For very large arrays, cloning the array and immediately deleting it on each iteration results in monstrous memory allocation rates and long GC delays.

Other than that, no.

Hello @Badger

Here is the logstash output when using a complete file:

exception=>#<LogStash::Json::ParserError: Unexpected end-of-input: expected close marker for Object (start marker at [Source: (byte[])"{
    "info": {
        "generated_on": "2017-12-03 08:41:42.057563", 
        "slice": "0-999", 
        "version": "v1"
    }, 
    "playlists": [
        {
            "name": "Throwbacks", 
            "collaborative": "false", 
            "pid": 0, 
            "modified_at": 1493424000, 
            "num_tracks": 52, 
            "num_albums": 47, 
            "num_followers": 1, 
            "tracks": [
                {
                    "pos": 0, 
                    "artist_name": ""[truncated 34118866 bytes]; line: 1, column: 1])
 at [Source: (byte[])"{
    "info": {
        "generated_on": "2017-12-03 08:41:42.057563", 
        "slice": "0-999", 
        "version": "v1"
    }, 
    "playlists": [
        {
            "name": "Throwbacks", 
            "collaborative": "false", 
            "pid": 0, 
            "modified_at": 1493424000, 
            "num_tracks": 52, 
            "num_albums": 47, 
            "num_followers": 1, 
            "tracks": [
                {
                    "pos": 0, 
                    "artist_name": ""[truncated 34118866 bytes]; line: 689058, column: 34119372]>}
[2021-12-08T06:01:56,759][WARN ][logstash.filters.split   ][main][2db25766a2e2a20dfe94279bf0ea70bddcaf131311c7db2fe09e225bf5560f53] Only String and Array types are splittable. field:playlists is of type = NilClass
[2021-12-08T06:01:56,760][WARN ][logstash.filters.split   ][main][3b9d9036dd11da5abb9b839b48e7d56d1b62c8c980f3cd831c83aac420f991d9] Only String and Array types are splittable. field:[playlists][tracks] is of type = NilClass
{
      "@version" => "1",
          "tags" => [
        [0] "multiline",
        [1] "_jsonparsefailure",
        [2] "_split_type_failure"
    ],
          "path" => "/home/user/Downloads/data/file.json",
          "host" => "user",
    "@timestamp" => 2021-12-08T11:01:51.655Z
}

My logstash.conf:

input{
 file{
    path => "/home/user/Downloads/data/file.json"
    sincedb_path => "/dev/null"
    start_position => "beginning"
    codec => multiline {
        pattern => "^Spalanzani"
        negate => true
        what => previous
        auto_flush_interval => 1
        max_lines => 10000000000000
        max_bytes => "5000 MiB"}

 }
}

filter {
    json { source => "message"}
    split { field => "playlists" }
    split { field => "[playlists][tracks]" }
    mutate {
        remove_field => [ "message" ]
    }

}

output{
    elasticsearch{
        hosts => "localhost:9200"
        index => "music"
    }
    stdout { }
}

This filter works with the example I show in the first message of this topic. I googled the error and it suggests my data is not correct (missing brackets and such) but i double checked and the data is correct.

Sorry for double post, I think I found what was wrong.

I added a line break '\n' at the beginning of the json file and now it works with the same filtre as i mentionned in my last message. Any idea why? My data is correctly parsed now

UPDATE:

Sorry for 3rd answer in a row.

The fact that it works is not thanks to the line break.

It is because of the fact that I edit the file then save it. Each time it treats a file that I haven't touched, I get the splitting errors. But as soon as I add a line break then save, then deletes that very same line break then save, it works. That's super weird. it looks like i need to save the files at least once. Trying to write a script right now because i have 1000 files. If you know why i need to do that, i would be very intersted to know

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.