Issue with outputting binary files

I'm trying to use Logstash to ingest emails and write the messages into Elasticsearch and write attachments onto the filesystem (maybe S3 later). My issue is that the files are not getting written correctly so not readable.

I'm stuck on writing the attachments to the filesystem. Here is a snippet of my pipelines.yaml:

- pipeline.id: email-write-attachments
  config.string: |
    input {
      pipeline { address => "process_attachments" }
    }

    filter {
      # Drop events that don't have attachments
      if ![attachments] {
        drop {}
      }
      # Prune out all fields except attachments and message-id
      prune {
        whitelist_names => ["^attachments$", "message-id"]
      }
      # Split each event by attachments
      split {
        field => "attachments"
      }
      # Decode attachments - Note not checking the "content-transfer-type", just assuming
      # it's always base64
      # https://discuss.elastic.co/t/filter-decode-from-base64/89282
      ruby {
        init => "require 'base64'"
        code => "event.set('[attachments][body]', Base64.decode64(event.get('[attachments][body]')))"
      }
    }

    output {
      file {
        path => "/work/attachments.out"
      }
      file {
        path => "/work/attachments/%{[message-id]}/%{[attachments][filename]}"
        codec => plain { format => "%{[attachments][body]}" }
      }
    }

The gist of what's going on above is taking the email messages, grabbing the attachments and splitting them up into individual events and then writing each event (attachment) into their respective file. For example an email could have attachments of a PDF and an image. Those two attachments get split into two events - one for the PDF and one for the image. In the end I want a PDF file and a image file given the filename from the email.

All the attachments come in Base64 encoded and so I have a filter step to decode. My issue I think lies with either the output file plugin or the codec plain plugin. The files being written seem to be binary but corrupt.

As a separate test, I wrote the base encoded attachments to file and then in a separate script simply read the file, base 64 decoded and then wrote to another file. That works fine so that tells me that the issue is with the output part of the pipeline.

Other things I tried:

  • Specfiying the charset to be BINARY had no impact
  • Set the file output to flush 0 had no impact
  • In my separate script, I tried opening the file with the a+ mode like that in the output file plugin and that still seemed to work

Any ideas where the problem is and how to fix it?

Thank you

I have uncovered the root cause of my problem. The culprit is the logstash-codec-plain plugin which invokes event.sprintf. The event.sprintf function always produces a UTF-8 string even if the logstash-codec-plain configuration parameter charset is not UTF-8 and even if the event value is not UTF-8.

Turns out in Logstash 5.0 there was breaking change where they moved the core Event object from native Ruby to Java for performance reasons. This also broke how logstash-codec-plain works because originally it relied on a pure Ruby implementation that doesn't force a different encoding on the value. Not sure if this was intended.

I tried forcing the resulting encoded value (result of event.sprintf) to be ASCII-8BIT but that didn't help.

There's another plugin called java_plain whose implementation is in Java and different because it actually respects the charset configuration. I tried testing it but got fatal exceptions trying to do so. I raised an issue in elastic/logstash.

My current solution is to adapt the logstash-codec-plain plugin and use the legacy event.sprintf implementation which I can share here later.

Further, it's my opinion that there is a bug here either in the new event.sprintf implementation or logstash-codec-plain (which should respect the charset but not sure if it can). Maybe the logstash-codec-plain plugin should just be deprecated?

3 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.