Hi there
We use the Ingest Attachment Processor Plugin in Elasticsearch 6.4.2 for indexing content from OCR-processed PDF files. Today a user reported an error regarding to a specific document which is provided here. Stack trace from Elasticsearch enclosed below.
Can you help me inspect what's causing this error and eventually how to avoid it. I've checked the release notes for newer versions, but none of them seems to fix this.
Thank you in advance.
Here's the full stack trace from the Elasticsearch log:
[2019-03-19T09:05:53,075][DEBUG][o.e.a.b.TransportBulkAction] [ELASTICSEARCH01] failed to execute pipeline [attachment] for document [kildeviserindex_udvikling/kildeviserindexmodel/161]
org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@55389b84]; nested: IOException[expected number, actual=COSFloat{18446744072911454224} at offset 1182692];
at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:155) ~[elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.ingest.PipelineExecutionService.access$100(PipelineExecutionService.java:43) ~[elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.ingest.PipelineExecutionService$1.doRun(PipelineExecutionService.java:78) [elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.4.2.jar:6.4.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.4.2.jar:6.4.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_191]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_191]
Caused by: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@55389b84]; nested: IOException[expected number, actual=COSFloat{18446744072911454224} at offset 1182692];
... 11 more
Caused by: org.elasticsearch.ElasticsearchParseException: Error parsing document in field [data]
at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:106) ~[?:?]
at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-6.4.2.jar:6.4.2]
... 9 more
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@55389b84
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) ~[?:?]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[?:?]
at org.apache.tika.Tika.parseToString(Tika.java:568) ~[?:?]
at org.elasticsearch.ingest.attachment.TikaImpl.lambda$parse$0(TikaImpl.java:108) ~[?:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_191]
at org.elasticsearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:107) ~[?:?]
at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:101) ~[?:?]
at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-6.4.2.jar:6.4.2]
... 9 more
Caused by: java.io.IOException: expected number, actual=COSFloat{18446744072911454224} at offset 1182692
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:166) ~[?:?]
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:279) ~[?:?]
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:212) ~[?:?]
at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:862) ~[?:?]
at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:852) ~[?:?]
at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:821) ~[?:?]
at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:741) ~[?:?]
at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:701) ~[?:?]
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:205) ~[?:?]
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240) ~[?:?]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144) ~[?:?]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1117) ~[?:?]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153) ~[?:?]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[?:?]
at org.apache.tika.Tika.parseToString(Tika.java:568) ~[?:?]
at org.elasticsearch.ingest.attachment.TikaImpl.lambda$parse$0(TikaImpl.java:108) ~[?:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_191]
at org.elasticsearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:107) ~[?:?]
at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:101) ~[?:?]
at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-6.4.2.jar:6.4.2]
... 9 more