Ingest Attachment plugin not working with WPD files

jamesfaix · September 23, 2019, 3:38pm

I am trying to index files of various formats as documents in Elasticsearch 7.1, and extract their text using the Ingest Attachment plugin. I am using Amazon's managed Elasticsearch service.

This has been working great for many formats, but it seems to not work for WPD (WordPerfect) files. When I try to insert a WordPerfect file, the Ingest Attachment plugin will not add the attachment fields to my document, but it will also not throw an error.

Here is a link to a C# project where I am setting up the pipeline. https://github.com/JamesFaix/es-indexer/blob/master/es-indexer/EsService.cs#L72 This project won't work locally because of security restrictions on the AWS instance it is pointing to, but it should work on any blank ES 7 instance.

In JSON, the pipeline declaration looks like this:

{
  "description": "This is a pipeline!",
  "processors": [
    {
      "attachment: {
        "field": "data",
        "properties": [
          "content"
        ]
      }
    },
    {
      "gsub": {
        "field": "attachment.content",
        "pattern": @"\s+",
        "replacement": " "
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "error",
        "value": "Text extraction failed ({{ _ingest.on_failure_message }})"
      }
    },
    {
      "set": {
        "field": "attachment.content",
        "value": ""
      }
    }
  }
}

What I am seeing is that if I just run the Attachment plugin on a WPD file, no error is thrown and the attachment.content field is never assigned to on the document. Since I am using GSUB after that, it will error because the field it is trying to read from is not there.

If I use a DOCX, PDF, or several other formats, both plugins work fine.

This could be an issue with the Attachment plugin or Tika, I'm not sure. The Tika documentation does say it supports WordPerfect and other Corel formats.

spinscale · September 24, 2019, 8:52am

you could try the tika commandline tool and see if that works on your file.

See https://tika.apache.org/1.20/gettingstarted.html

jamesfaix · September 25, 2019, 2:23pm

I tried creating a small test project in Java locally using Tika and it required an extra library org.apache.tika.parsers.wordPerfectParser. Perhaps this is missing in the Ingest Attachment plugin.

spinscale · September 25, 2019, 2:33pm

If that library is not listed here in the dependencies at https://github.com/elastic/elasticsearch/blob/master/plugins/ingest-attachment/build.gradle#L25-L71 - it's likely that this will be culprit.

system · October 23, 2019, 2:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Troubles with different file types using ingest attachment processor plugin Elasticsearch	8	3167	February 23, 2017
Best way to use Ingest Attachment Plugin Elasticsearch ingest-pipeline	4	514	December 31, 2021
Error while using ingest attachment plugin on some docs Elasticsearch	13	1738	November 29, 2018
Ingest plugin .docx issue Elasticsearch	8	1268	April 1, 2019
Ingest attachment plugin installed, nodes restarted still get the No handler for type [attachment] error Elasticsearch	4	415	October 25, 2019

Ingest Attachment plugin not working with WPD files

Related topics