Hi everyone, I have the following problem:
We have in place a pipeline which consist of:
[PrestoDB Clusters] ==auditing==> [Kafka] <== [Logstash] ==> Elastic + S3
The auditing messages on kafka are basically json messages composed of various fields which may contain ANY character typable by the user.
The ingestion on elastic works almost with no problems.
Now I wanted to write on s3 some fields (but potentially all of them) in a text file or eventually in a parquet file.
So I am using the S3 Output Plugin,
I had to configure it in this way to make it somehow work, but obviously I am facing many problem due to characters like newlines, delimiters, strange characters etc… Also this doesn’t seem like a good approach since I have 20-30 more fields to put.
s3 {
region => "eu-west-1"
bucket => "my-bucket"
prefix => "audit/some/sub/folder"
encoding => "none"
rotation_strategy => "size_and_time"
temporary_directory => "/tmp/logstash"
upload_queue_size => 4
upload_workers_count => 4
size_file => 5242880
time_file => 2
codec => line {
format => "%{[CreateDate]}|%{[orgId]}|%{[QueryID]}|%{[Catalog]}|%{[User]}|%{[Query]}|%{[QueryStartTime]}|%{[EventName]}|%{[QueryType]}|%{[QueryEndTime]}"
}
I have also tried the json codec, which does the job pretty well but I don’t want to write data in json format, since the files will be read in Presto/spark clusters by data scientists and it is not convenient to parse json with these tools.
I have tried with the csv codec but it doesn’t work at all, and I couldn’t understand why…
Is there something I am missing?