I'm having multiple problems with the ingest attachment plugin.
- I'm on ES version 5.1.1
- The ingest attachment plugin has been installed
- I've created my pipeline processor(s)
- I have successfully ingested some simple text type documents
The trouble comes when I attempt more complex type documents.
I've attempted several Office type docs:
pptx, ppt, docx, etc.
With these the index request works fine and the attachment(s) appear to be indexed properly, however they always have no content:
"attachment": {
"content_type": "application/zip",
"content_length": 0
}
It is also interesting that for the majority of them they show a content_type of "application/zip", rather than an MS Office content type.
After seeing this I moved on to PDF documents, and ran into indexing failures.
I tried several different PDFs each from different sources (created differently). Basically, each one had a different parse error generated by the Tika tool.
The most simple PDF - simple text converted from a Word document results in the following error:
{
"error": {
"root_cause": [
{
"type": "exception",
"reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
"header": {
"processor_type": "foreach"
}
}
],
"type": "exception",
"reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[java.util.zip.DataFormatException: invalid block type]; nested: DataFormatException[invalid block type];",
"caused_by": {
"type": "parse_exception",
"reason": "Error parsing document in field [_ingest._value.fileData]",
"caused_by": {
"type": "tika_exception",
"reason": "TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5",
"caused_by": {
"type": "i_o_exception",
"reason": "java.util.zip.DataFormatException: invalid block type",
"caused_by": {
"type": "data_format_exception",
"reason": "invalid block type"
}
}
}
}
},
"header": {
"processor_type": "foreach"
}
},
"status": 500
}
A larger more complex PDF gives me this error:
{
"error": {
"root_cause": [
{
"type": "exception",
"reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[Missing root object specification in trailer.];",
"header": {
"processor_type": "foreach"
}
}
],
"type": "exception",
"reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[Missing root object specification in trailer.];",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "ElasticsearchParseException[Error parsing document in field [_ingest._value.fileData]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5]; nested: IOException[Missing root object specification in trailer.];",
"caused_by": {
"type": "parse_exception",
"reason": "Error parsing document in field [_ingest._value.fileData]",
"caused_by": {
"type": "tika_exception",
"reason": "TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@51425bb5",
"caused_by": {
"type": "i_o_exception",
"reason": "Missing root object specification in trailer."
}
}
}
},
"header": {
"processor_type": "foreach"
}
},
"status": 500
}
Here is my attachment pipeline:
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information from arrays",
"processors" : [
{
"foreach": {
"field": "files",
"processor": {
"attachment": {
"target_field": "_ingest._value.attachment",
"field": "_ingest._value.fileData",
"ignore_failure" : true
}
}
}
},
{
"foreach": {
"field": "files",
"processor": {
"remove": {
"field": "_ingest._value.fileData"
}
}
}
}
]
}
Am I running up against a Tika configuration issue, or are there other configurations in ES that I need to look at?