Troubles with different file types using ingest attachment processor plugin


(Russ) #1

I'm having multiple problems with the ingest attachment plugin.

  • I'm on ES version 5.1.1
  • The ingest attachment plugin has been installed
  • I've created my pipeline processor(s)
  • I have successfully ingested some simple text type documents

The trouble comes when I attempt more complex type documents.
I've attempted several Office type docs:
pptx, ppt, docx, etc.
With these the index request works fine and the attachment(s) appear to be indexed properly, however they always have no content:

"attachment": {
    "content_type": "application/zip",
    "content_length": 0
}

It is also interesting that for the majority of them they show a content_type of "application/zip", rather than an MS Office content type.

After seeing this I moved on to PDF documents, and ran into indexing failures.
I tried several different PDFs each from different sources (created differently). Basically, each one had a different parse error generated by the Tika tool.

The most simple PDF - simple text converted from a Word document results in the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
        "header": {
          "processor_type": "foreach"
        }
      }
    ],
    "type": "exception",
    "reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
      "caused_by": {
        "type": "parse_exception",
        "reason": "Error parsing document in field [_ingest._value.fileData]",
        "caused_by": {
          "type": "tika_exception",
          "reason": "TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5",
          "caused_by": {
            "type": "i_o_exception",
            "reason": "java.util.zip.DataFormatException: invalid block type",
            "caused_by": {
              "type": "data_format_exception",
              "reason": "invalid block type"
            }
          }
        }
      }
    },
    "header": {
      "processor_type": "foreach"
    }
  },
  "status": 500
}

A larger more complex PDF gives me this error:

{
  "error": {
    "root_cause": [
      {
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[Missing root object specification in trailer.];",
        "header": {
          "processor_type": "foreach"
        }
      }
    ],
    "type": "exception",
    "reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[Missing root object specification in trailer.];",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[Missing root object specification in trailer.];",
      "caused_by": {
        "type": "parse_exception",
        "reason": "Error parsing document in field [_ingest._value.fileData]",
        "caused_by": {
          "type": "tika_exception",
          "reason": "TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5",
          "caused_by": {
            "type": "i_o_exception",
            "reason": "Missing root object specification in trailer."
          }
        }
      }
    },
    "header": {
      "processor_type": "foreach"
    }
  },
  "status": 500
}

Here is my attachment pipeline:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information from arrays",
  "processors" : [
    {
      "foreach": {
        "field": "files",
        "processor": {
          "attachment": {
            "target_field": "_ingest._value.attachment",
            "field": "_ingest._value.fileData",
            "ignore_failure" : true
          }
        }
      }
    },
    {
      "foreach": {
        "field": "files",
        "processor": {
          "remove": {
            "field": "_ingest._value.fileData"
          }
        }
      }
    }
  ]
}

Am I running up against a Tika configuration issue, or are there other configurations in ES that I need to look at?


(David Pilato) #2

Can you share some of your documents?


(Russ) #3

Yes, I've tried all of the files found here:

Attempted Files (google drive)

All of the PDFs cause errors. All of the others appear to index fine, but there is never any content.


(David Pilato) #4

Thanks for sharing your files. It"s really useful!

I tried your files with FSCrawler to see how it goes:

  • gh_playset_summary.pdf contains no text. Only images. As a result, I'm getting:
"\n\n"

Other documents are working well.

I'm going to try hopefully tomorrow the same docs with ingest to see how it goes. I'll update the thread.


(Russ) #5

Thank you for your time David.

When I attempt that same file (gh_playset_summary.pdf) I get the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
        "header": {
          "processor_type": "foreach"
        }
      }
    ],
    "type": "exception",
    "reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
      "caused_by": {
        "type": "parse_exception",
        "reason": "Error parsing document in field [_ingest._value.fileData]",
        "caused_by": {
          "type": "tika_exception",
          "reason": "TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5",
          "caused_by": {
            "type": "i_o_exception",
            "reason": "java.util.zip.DataFormatException: invalid block type",
            "caused_by": {
              "type": "data_format_exception",
              "reason": "invalid block type"
            }
          }
        }
      }
    },
    "header": {
      "processor_type": "foreach"
    }
  },
  "status": 500
}

(David Pilato) #6

Yeah. I saw your initial post. I just need some time to test it with ingest-attachment now. It's on my list.

Tomorrow hopefully. :slight_smile:


(Russ) #7

Hi David,

I was able to resolve my problem. Turns out it was a lack of carefully reading the documentation on the ingest attachment plugin.

The source field must be a base64 encoded binary.

I was reading the files in directly, not as binary, and converting them to base64 directly.

It is interesting that simple text files still worked that way, and others half worked, and others failed.

After I read the files in as binary and then converted to base64 they all work perfectly now.

Again, thanks for your assistance.


(David Pilato) #8

Awesome! Thanks for the follow up!


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.