Logstash csv json quotes nightmare

Hello,
I'm trying to parse a Kaggle movies dataset, which are csv files, but I'm having a lot of headaches with the fields that actually hold a json object...
I'm now almost done with it, but I had to do a lot of, perhaps unnecessary, maneuveurs to get this to work. Like this one:

mutate {
    gsub => [
        "production_companies", 'Orlenok",', 'Orlenok"',
        "production_companies", "\\xa0", "",
        "production_companies", ': "', ": '",
        "production_companies", '",', "',",
        "production_companies", '"', '\\"',
        "production_companies", "((?<={)\s*\'|(?<=,)\s*\'|\'\s*(?=:)|(?<=:)\s*\'|\'\s*(?=,)|\'\s*(?=}))", '"'
    ]
}
json {
    source => "production_companies"
    target => "production_companies"
}

I can now import the movies and keywords datasets, but I am not getting any luck with the credits dataset:

cast,id
"[{'name': 'John Doe'}, {'name': 'Jane ""Doe""'}, {'name': ""Daisy 'Margaret' Drummond""}]",1

As you can see, it is a quotations nightmare. I'm pretty sure there's got to be a more elegant way to treat this pattern (a csv that holds json in its fields)...

So I ask: how would you generally solve this in a Logstash configuration?

Here is a partial example of my movies configuration file, only to show an example of the kind of problems I had to solve so far...
...but again, I am looking for the generally accepted solution to parse this standard (is it?) of json objects inside csv fields.

Thank you!

input {
    stdin { }

    file {
        path => "${PWD}/data/movies_metadata.csv"
        start_position => "beginning"
        sincedb_path => "/dev/null"
        codec => multiline {
            pattern => "^(False|True),"
            negate => true
            what => "previous"
            auto_flush_interval => 1
        }
    }
}
filter {
    csv {
        columns => [ "adult","belongs_to_collection","budget","genres","homepage","id","imdb_id","original_language","original_title","overview","popularity","poster_path","production_companies","production_countries","release_date","revenue","runtime","spoken_languages","status","tagline","title","video","vote_average","vote_count" ]
        skip_header => true
    }

    mutate {
        gsub => [
            "production_companies", 'Orlenok",', 'Orlenok"',
            "production_companies", "\\xa0", "",
            "production_companies", ': "', ": '",
            "production_companies", '",', "',",
            "production_companies", '"', '\\"',
            "production_companies", "((?<={)\s*\'|(?<=,)\s*\'|\'\s*(?=:)|(?<=:)\s*\'|\'\s*(?=,)|\'\s*(?=}))", '"'
        ]
    }
    json {
        source => "production_companies"
        target => "production_companies"
    }
}
output {
    stdout { codec => dots }

    #stdout { codec => rubydebug }

    elasticsearch {
        hosts => ["localhost:9200"]
        index => "movies"
    }
}

Movies dataset:

@ffknob For the credit dataset I would try to match it as [...],[...],n and not treat it as a csv. That allows you to get the cast and crew into their own objects. This works

grok { match => { "message" => '^(\[\]|"(?<[@metadata][crew]>\[.*\])"),(\[\]|"(?<[@metadata][cast]>\[.*\])"),%{INT:id}' } remove_field => "message" }

However, this just moves you on to a different nightmare. The JSON arrays use single quotes, like

[ { 'foo': 'a' } ]

but a json filter requires double quotes. You need to reverse the usage of single and double quotes but that still leaves you with issues like

'name': ""Kelly O'Connell""

where double-double quotes is used to signify that the string may contain single quotes.

This appears to be pretty common, but easy to fix

'profile_path': None

It just feels like there is going to be a huge amount of ad-hocery involved.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.