S3 Output Plugin: Correct Way to manage codec

Hammond95 · August 20, 2021, 5:12pm

Hi everyone, I have the following problem:
We have in place a pipeline which consist of:

[PrestoDB Clusters]  ==auditing==> [Kafka] <== [Logstash] ==> Elastic + S3

The auditing messages on kafka are basically json messages composed of various fields which may contain ANY character typable by the user.

The ingestion on elastic works almost with no problems.
Now I wanted to write on s3 some fields (but potentially all of them) in a text file or eventually in a parquet file.

So I am using the S3 Output Plugin,
I had to configure it in this way to make it somehow work, but obviously I am facing many problem due to characters like newlines, delimiters, strange characters etc… Also this doesn’t seem like a good approach since I have 20-30 more fields to put.

s3 {
        region => "eu-west-1"
        bucket => "my-bucket"
        prefix => "audit/some/sub/folder"
        encoding => "none"
        rotation_strategy => "size_and_time"
        temporary_directory => "/tmp/logstash"
        upload_queue_size => 4
        upload_workers_count => 4
        size_file => 5242880
        time_file => 2
        codec => line {
          format => "%{[CreateDate]}|%{[orgId]}|%{[QueryID]}|%{[Catalog]}|%{[User]}|%{[Query]}|%{[QueryStartTime]}|%{[EventName]}|%{[QueryType]}|%{[QueryEndTime]}"
}

I have also tried the json codec, which does the job pretty well but I don’t want to write data in json format, since the files will be read in Presto/spark clusters by data scientists and it is not convenient to parse json with these tools.

I have tried with the csv codec but it doesn’t work at all, and I couldn’t understand why…

Is there something I am missing?

Hammond95 · August 23, 2021, 5:21pm

I managed to solve my problem.

I missed it both from the logs and the docs, but actually this codec plugin (csv) doesn't come installed, so you have to install it first with:

bin/logstash-plugin install logstash-codec-csv

After that I had to escape newline from fields to avoid unwanted line breaks:

mutate {
    gsub => [ "[Query]", "[\n]", "\\\\n" ]
    gsub => [ "[PreparedQuery]", "[\n]", "\\\\n" ]
}

The plugin will take care of doubling any double quote in the data (that's how you escape doublequotes).

For the separator I have opened another thread in the forum, see:

system · September 20, 2021, 5:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
S3 Output Plugin Codec Issue Logstash	1	804	July 6, 2017
Logstash output to file and s3 is different Logstash	3	358	October 17, 2019
Question about s3 output plugin Logstash	5	1116	April 13, 2018
Logstash-output-s3 line termination Logstash	3	673	March 5, 2019
S3 output plugin generates invalid json Logstash	2	553	June 16, 2020

S3 Output Plugin: Correct Way to manage codec

Related topics