Troubles with different file types using ingest attachment processor plugin

rshake · January 25, 2017, 9:27pm

I'm having multiple problems with the ingest attachment plugin.

I'm on ES version 5.1.1
The ingest attachment plugin has been installed
I've created my pipeline processor(s)
I have successfully ingested some simple text type documents

The trouble comes when I attempt more complex type documents.
I've attempted several Office type docs:
pptx, ppt, docx, etc.
With these the index request works fine and the attachment(s) appear to be indexed properly, however they always have no content:

"attachment": {
    "content_type": "application/zip",
    "content_length": 0
}

It is also interesting that for the majority of them they show a content_type of "application/zip", rather than an MS Office content type.

After seeing this I moved on to PDF documents, and ran into indexing failures.
I tried several different PDFs each from different sources (created differently). Basically, each one had a different parse error generated by the Tika tool.

The most simple PDF - simple text converted from a Word document results in the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
        "header": {
          "processor_type": "foreach"
        }
      }
    ],
    "type": "exception",
    "reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
      "caused_by": {
        "type": "parse_exception",
        "reason": "Error parsing document in field [_ingest._value.fileData]",
        "caused_by": {
          "type": "tika_exception",
          "reason": "TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5",
          "caused_by": {
            "type": "i_o_exception",
            "reason": "java.util.zip.DataFormatException: invalid block type",
            "caused_by": {
              "type": "data_format_exception",
              "reason": "invalid block type"
            }
          }
        }
      }
    },
    "header": {
      "processor_type": "foreach"
    }
  },
  "status": 500
}

A larger more complex PDF gives me this error:

{
  "error": {
    "root_cause": [
      {
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[Missing root object specification in trailer.];",
        "header": {
          "processor_type": "foreach"
        }
      }
    ],
    "type": "exception",
    "reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[Missing root object specification in trailer.];",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[Missing root object specification in trailer.];",
      "caused_by": {
        "type": "parse_exception",
        "reason": "Error parsing document in field [_ingest._value.fileData]",
        "caused_by": {
          "type": "tika_exception",
          "reason": "TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5",
          "caused_by": {
            "type": "i_o_exception",
            "reason": "Missing root object specification in trailer."
          }
        }
      }
    },
    "header": {
      "processor_type": "foreach"
    }
  },
  "status": 500
}

Here is my attachment pipeline:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information from arrays",
  "processors" : [
    {
      "foreach": {
        "field": "files",
        "processor": {
          "attachment": {
            "target_field": "_ingest._value.attachment",
            "field": "_ingest._value.fileData",
            "ignore_failure" : true
          }
        }
      }
    },
    {
      "foreach": {
        "field": "files",
        "processor": {
          "remove": {
            "field": "_ingest._value.fileData"
          }
        }
      }
    }
  ]
}

Am I running up against a Tika configuration issue, or are there other configurations in ES that I need to look at?

dadoonet · January 26, 2017, 4:33am

Can you share some of your documents?

rshake · January 26, 2017, 3:12pm

Yes, I've tried all of the files found here:

Attempted Files (google drive)

All of the PDFs cause errors. All of the others appear to index fine, but there is never any content.

dadoonet · January 26, 2017, 5:49pm

Thanks for sharing your files. It"s really useful!

I tried your files with FSCrawler to see how it goes:

gh_playset_summary.pdf contains no text. Only images. As a result, I'm getting:

"\n\n"

Other documents are working well.

I'm going to try hopefully tomorrow the same docs with ingest to see how it goes. I'll update the thread.

rshake · January 26, 2017, 5:58pm

Thank you for your time David.

When I attempt that same file (gh_playset_summary.pdf) I get the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
        "header": {
          "processor_type": "foreach"
        }
      }
    ],
    "type": "exception",
    "reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
      "caused_by": {
        "type": "parse_exception",
        "reason": "Error parsing document in field [_ingest._value.fileData]",
        "caused_by": {
          "type": "tika_exception",
          "reason": "TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5",
          "caused_by": {
            "type": "i_o_exception",
            "reason": "java.util.zip.DataFormatException: invalid block type",
            "caused_by": {
              "type": "data_format_exception",
              "reason": "invalid block type"
            }
          }
        }
      }
    },
    "header": {
      "processor_type": "foreach"
    }
  },
  "status": 500
}

dadoonet · January 26, 2017, 6:07pm

Yeah. I saw your initial post. I just need some time to test it with ingest-attachment now. It's on my list.

Tomorrow hopefully.

rshake · January 26, 2017, 9:31pm

Hi David,

I was able to resolve my problem. Turns out it was a lack of carefully reading the documentation on the ingest attachment plugin.

The source field must be a base64 encoded binary.

I was reading the files in directly, not as binary, and converting them to base64 directly.

It is interesting that simple text files still worked that way, and others half worked, and others failed.

After I read the files in as binary and then converted to base64 they all work perfectly now.

Again, thanks for your assistance.

dadoonet · January 26, 2017, 9:56pm

Awesome! Thanks for the follow up!

system · February 23, 2017, 9:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticsearchParseException using Ingest Attachment Processor Plugin in Elasticsearch 6.4.2 Elasticsearch	9	1737	April 17, 2019
Getting error while parsing documents Elasticsearch	13	6431	June 8, 2017
Error while using ingest attachment plugin on some docs Elasticsearch	13	1744	November 29, 2018
Ingest Attachment plugin not working with WPD files Elasticsearch	4	474	October 23, 2019
Attachment Pipeline Support for Old MS Word and Excel Format Elasticsearch	4	598	December 28, 2021

Troubles with different file types using ingest attachment processor plugin

Related topics