Attachment Pipeline Support for Old MS Word and Excel Format

Hi All! I'm uploading a Microsoft Word in .doc format, instead of .docx, to Elasticsearch using attachment pipeline, and I receive the following response.

{
    "error": {
        "root_cause": [
            {
                "type": "parse_exception",
                "reason": "Error parsing document in field [fileContent]"
            }
        ],
        "type": "parse_exception",
        "reason": "Error parsing document in field [fileContent]",
        "caused_by": {
            "type": "no_such_file_exception",
            "reason": "/tmp/elasticsearch-8583309320442462221/apache-tika-15922682441714930542.tmp"
        }
    },
    "status": 400
}

Please advise if there's any limitation for the pipeline or any additional setup is required. Thanks in advance.

Welcome!

That's a weird error.
Is there a chance you could share your binary document?

Which Elasticsearch version are you using?

The version of my Elasticsearch is 7.12.0, and I installed ingest attachment plugin as introduced at Ingest Attachment Processor Plugin | Elasticsearch Plugins and Integrations [7.15] | Elastic
Please find my sample document and the request JSON here. Thanks.

So I tried your file with this pipeline:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "fileContent"
      }
    },{
      "remove" : {
        "field" : "fileContent"
      }
    }
  ]
}

And then:

POST /_ingest/pipeline/attachment/_simulate
{
  "docs": [
    {
      "_source": { "fileContent": "BASE64-CONTENT-HERE" }
    }
    ]
}

This gave:

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "attachment" : {
            "date" : "2021-11-25T05:54:00Z",
            "language" : "lt",
            "content_type" : "application/msword",
            "author" : "Chan, David",
            "content" : "abc",
            "content_length" : 4
          }
        },
        "_ingest" : {
          "timestamp" : "2021-11-30T16:11:45.118139599Z"
        }
      }
    }
  ]
}

I suspect an error in Tika which might have been upgraded since then.
I tested on 7.15.1.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.