Ingest Attachment plugin not working with WPD files

I am trying to index files of various formats as documents in Elasticsearch 7.1, and extract their text using the Ingest Attachment plugin. I am using Amazon's managed Elasticsearch service.

This has been working great for many formats, but it seems to not work for WPD (WordPerfect) files. When I try to insert a WordPerfect file, the Ingest Attachment plugin will not add the attachment fields to my document, but it will also not throw an error.

Here is a link to a C# project where I am setting up the pipeline. https://github.com/JamesFaix/es-indexer/blob/master/es-indexer/EsService.cs#L72 This project won't work locally because of security restrictions on the AWS instance it is pointing to, but it should work on any blank ES 7 instance.

In JSON, the pipeline declaration looks like this:

{
  "description": "This is a pipeline!",
  "processors": [
    {
      "attachment: {
        "field": "data",
        "properties": [
          "content"
        ]
      }
    },
    {
      "gsub": {
        "field": "attachment.content",
        "pattern": @"\s+",
        "replacement": " "
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "error",
        "value": "Text extraction failed ({{ _ingest.on_failure_message }})"
      }
    },
    {
      "set": {
        "field": "attachment.content",
        "value": ""
      }
    }
  }
}

What I am seeing is that if I just run the Attachment plugin on a WPD file, no error is thrown and the attachment.content field is never assigned to on the document. Since I am using GSUB after that, it will error because the field it is trying to read from is not there.

If I use a DOCX, PDF, or several other formats, both plugins work fine.

This could be an issue with the Attachment plugin or Tika, I'm not sure. The Tika documentation does say it supports WordPerfect and other Corel formats.

you could try the tika commandline tool and see if that works on your file.

See https://tika.apache.org/1.20/gettingstarted.html

I tried creating a small test project in Java locally using Tika and it required an extra library org.apache.tika.parsers.wordPerfectParser. Perhaps this is missing in the Ingest Attachment plugin.

If that library is not listed here in the dependencies at https://github.com/elastic/elasticsearch/blob/master/plugins/ingest-attachment/build.gradle#L25-L71 - it's likely that this will be culprit.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.