I am trying to index files of various formats as documents in Elasticsearch 7.1, and extract their text using the Ingest Attachment plugin. I am using Amazon's managed Elasticsearch service.
This has been working great for many formats, but it seems to not work for WPD (WordPerfect) files. When I try to insert a WordPerfect file, the Ingest Attachment plugin will not add the attachment fields to my document, but it will also not throw an error.
Here is a link to a C# project where I am setting up the pipeline. https://github.com/JamesFaix/es-indexer/blob/master/es-indexer/EsService.cs#L72 This project won't work locally because of security restrictions on the AWS instance it is pointing to, but it should work on any blank ES 7 instance.
In JSON, the pipeline declaration looks like this:
{
"description": "This is a pipeline!",
"processors": [
{
"attachment: {
"field": "data",
"properties": [
"content"
]
}
},
{
"gsub": {
"field": "attachment.content",
"pattern": @"\s+",
"replacement": " "
}
}
],
"on_failure": [
{
"set": {
"field": "error",
"value": "Text extraction failed ({{ _ingest.on_failure_message }})"
}
},
{
"set": {
"field": "attachment.content",
"value": ""
}
}
}
}
What I am seeing is that if I just run the Attachment plugin on a WPD file, no error is thrown and the attachment.content
field is never assigned to on the document. Since I am using GSUB after that, it will error because the field it is trying to read from is not there.
If I use a DOCX, PDF, or several other formats, both plugins work fine.
This could be an issue with the Attachment plugin or Tika, I'm not sure. The Tika documentation does say it supports WordPerfect and other Corel formats.