Issue with outputting binary files

mikehwang · March 18, 2020, 9:38pm

I'm trying to use Logstash to ingest emails and write the messages into Elasticsearch and write attachments onto the filesystem (maybe S3 later). My issue is that the files are not getting written correctly so not readable.

I'm stuck on writing the attachments to the filesystem. Here is a snippet of my pipelines.yaml:

- pipeline.id: email-write-attachments
  config.string: |
    input {
      pipeline { address => "process_attachments" }
    }

    filter {
      # Drop events that don't have attachments
      if ![attachments] {
        drop {}
      }
      # Prune out all fields except attachments and message-id
      prune {
        whitelist_names => ["^attachments$", "message-id"]
      }
      # Split each event by attachments
      split {
        field => "attachments"
      }
      # Decode attachments - Note not checking the "content-transfer-type", just assuming
      # it's always base64
      # https://discuss.elastic.co/t/filter-decode-from-base64/89282
      ruby {
        init => "require 'base64'"
        code => "event.set('[attachments][body]', Base64.decode64(event.get('[attachments][body]')))"
      }
    }

    output {
      file {
        path => "/work/attachments.out"
      }
      file {
        path => "/work/attachments/%{[message-id]}/%{[attachments][filename]}"
        codec => plain { format => "%{[attachments][body]}" }
      }
    }

The gist of what's going on above is taking the email messages, grabbing the attachments and splitting them up into individual events and then writing each event (attachment) into their respective file. For example an email could have attachments of a PDF and an image. Those two attachments get split into two events - one for the PDF and one for the image. In the end I want a PDF file and a image file given the filename from the email.

All the attachments come in Base64 encoded and so I have a filter step to decode. My issue I think lies with either the output file plugin or the codec plain plugin. The files being written seem to be binary but corrupt.

As a separate test, I wrote the base encoded attachments to file and then in a separate script simply read the file, base 64 decoded and then wrote to another file. That works fine so that tells me that the issue is with the output part of the pipeline.

Other things I tried:

Specfiying the charset to be BINARY had no impact
Set the file output to flush 0 had no impact
In my separate script, I tried opening the file with the a+ mode like that in the output file plugin and that still seemed to work

Any ideas where the problem is and how to fix it?

Thank you

mikehwang · March 23, 2020, 11:17pm

I have uncovered the root cause of my problem. The culprit is the logstash-codec-plain plugin which invokes event.sprintf. The event.sprintf function always produces a UTF-8 string even if the logstash-codec-plain configuration parameter charset is not UTF-8 and even if the event value is not UTF-8.

Turns out in Logstash 5.0 there was breaking change where they moved the core Event object from native Ruby to Java for performance reasons. This also broke how logstash-codec-plain works because originally it relied on a pure Ruby implementation that doesn't force a different encoding on the value. Not sure if this was intended.

I tried forcing the resulting encoded value (result of event.sprintf) to be ASCII-8BIT but that didn't help.

There's another plugin called java_plain whose implementation is in Java and different because it actually respects the charset configuration. I tried testing it but got fatal exceptions trying to do so. I raised an issue in elastic/logstash.

My current solution is to adapt the logstash-codec-plain plugin and use the legacy event.sprintf implementation which I can share here later.

Further, it's my opinion that there is a bug here either in the new event.sprintf implementation or logstash-codec-plain (which should respect the charset but not sure if it can). Maybe the logstash-codec-plain plugin should just be deprecated?

system · April 20, 2020, 11:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Allow logstash-output-http to send binary data Elasticsearch	1	2000	July 5, 2017
"Incompatible encoding" when using Logstash to ship JSON files to Elasticsearch Elasticsearch	6	1009	July 6, 2017
Sending Attachments: Unexpected end-of-input in VALUE_STRING Elasticsearch	20	8428	July 6, 2017
S3 Output Plugin Codec Issue Logstash	1	799	July 6, 2017
Is it possible to get attachments as binary stream? Elasticsearch	4	363	July 6, 2017

Issue with outputting binary files

Related topics