Hello,
I'm trying to parse a Kaggle movies dataset, which are csv files, but I'm having a lot of headaches with the fields that actually hold a json object...
I'm now almost done with it, but I had to do a lot of, perhaps unnecessary, maneuveurs to get this to work. Like this one:
mutate {
gsub => [
"production_companies", 'Orlenok",', 'Orlenok"',
"production_companies", "\\xa0", "",
"production_companies", ': "', ": '",
"production_companies", '",', "',",
"production_companies", '"', '\\"',
"production_companies", "((?<={)\s*\'|(?<=,)\s*\'|\'\s*(?=:)|(?<=:)\s*\'|\'\s*(?=,)|\'\s*(?=}))", '"'
]
}
json {
source => "production_companies"
target => "production_companies"
}
I can now import the movies and keywords datasets, but I am not getting any luck with the credits dataset:
cast,id
"[{'name': 'John Doe'}, {'name': 'Jane ""Doe""'}, {'name': ""Daisy 'Margaret' Drummond""}]",1
As you can see, it is a quotations nightmare. I'm pretty sure there's got to be a more elegant way to treat this pattern (a csv that holds json in its fields)...
So I ask: how would you generally solve this in a Logstash configuration?
Here is a partial example of my movies configuration file, only to show an example of the kind of problems I had to solve so far...
...but again, I am looking for the generally accepted solution to parse this standard (is it?) of json objects inside csv fields.
Thank you!
input {
stdin { }
file {
path => "${PWD}/data/movies_metadata.csv"
start_position => "beginning"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(False|True),"
negate => true
what => "previous"
auto_flush_interval => 1
}
}
}
filter {
csv {
columns => [ "adult","belongs_to_collection","budget","genres","homepage","id","imdb_id","original_language","original_title","overview","popularity","poster_path","production_companies","production_countries","release_date","revenue","runtime","spoken_languages","status","tagline","title","video","vote_average","vote_count" ]
skip_header => true
}
mutate {
gsub => [
"production_companies", 'Orlenok",', 'Orlenok"',
"production_companies", "\\xa0", "",
"production_companies", ': "', ": '",
"production_companies", '",', "',",
"production_companies", '"', '\\"',
"production_companies", "((?<={)\s*\'|(?<=,)\s*\'|\'\s*(?=:)|(?<=:)\s*\'|\'\s*(?=,)|\'\s*(?=}))", '"'
]
}
json {
source => "production_companies"
target => "production_companies"
}
}
output {
stdout { codec => dots }
#stdout { codec => rubydebug }
elasticsearch {
hosts => ["localhost:9200"]
index => "movies"
}
}
Movies dataset: