Synchronize whole files (mostly PDFs) to ES from directory and e-mails

Hi all,

we would like to setup a document synchronization from a file directory and from a e-mail imap folder to elasticsearch.

For the e-mails we are interested in the attachments. Here the imap input plugin looks promising.

For the files (e.g. PDFs) we would like to setup a directory that gets scanned and whenever a new file is inserted the whole file content (e.g. as a byte array) gets transferred.
Here the file input plugin looks like a candidate but maybe not. From the documentation it is not clear to me if we could configure it for the purpose I described.

Any opinions/suggestions? Maybe there are other plugins or totally different solutions out there, we are not aware of.

Thanks in advance,
Johannes

Nobody any experience in this topic or at least any suggestions/opinions?

Actually the IMAP input worked quite well.

input {
  imap {
    ...
  }
}

filter {
  #duplicate event for each attachment
  split {
    field => "attachments"
  }
  if ".pdf" in [attachments][filename] {
    ...
  } else {
    #drop non-pdf attachments
    drop { }
  }
}

output {
  ...
}

The file input plugin will most emphatically not be your friend I think. Logstash is really mostly about... well Logs, and is not really kitted out for use as an generalised Elasticsearch ETL tool.

I would suggest something a bit more custom perhaps. Considering you're doing things with PDFs, you might want to check out more about Ingest Pipelines and what those might offer (sorry, I'm not familiar with non-Logs use-cases with Ingest Pipelines).

No experience with the imap plugin, but its interesting to see that it can do that.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.