ElasticsearchParseException using Ingest Attachment Processor Plugin in Elasticsearch 6.4.2


(Thomas) #1

Hi there

We use the Ingest Attachment Processor Plugin in Elasticsearch 6.4.2 for indexing content from OCR-processed PDF files. Today a user reported an error regarding to a specific document which is provided here. Stack trace from Elasticsearch enclosed below.

Can you help me inspect what's causing this error and eventually how to avoid it. I've checked the release notes for newer versions, but none of them seems to fix this.

Thank you in advance.

Here's the full stack trace from the Elasticsearch log:

[2019-03-19T09:05:53,075][DEBUG][o.e.a.b.TransportBulkAction] [ELASTICSEARCH01] failed to execute pipeline [attachment] for document [kildeviserindex_udvikling/kildeviserindexmodel/161]
org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@55389b84]; nested: IOException[expected number, actual=COSFloat{18446744072911454224} at offset 1182692];
	at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:155) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.ingest.PipelineExecutionService.access$100(PipelineExecutionService.java:43) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.ingest.PipelineExecutionService$1.doRun(PipelineExecutionService.java:78) [elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.4.2.jar:6.4.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_191]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_191]
	at java.lang.Thread.run(Unknown Source) [?:1.8.0_191]
Caused by: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@55389b84]; nested: IOException[expected number, actual=COSFloat{18446744072911454224} at offset 1182692];
	... 11 more
Caused by: org.elasticsearch.ElasticsearchParseException: Error parsing document in field [data]
	at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:106) ~[?:?]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-6.4.2.jar:6.4.2]
	... 9 more
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@55389b84
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) ~[?:?]
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[?:?]
	at org.apache.tika.Tika.parseToString(Tika.java:568) ~[?:?]
	at org.elasticsearch.ingest.attachment.TikaImpl.lambda$parse$0(TikaImpl.java:108) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_191]
	at org.elasticsearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:107) ~[?:?]
	at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:101) ~[?:?]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-6.4.2.jar:6.4.2]
	... 9 more
Caused by: java.io.IOException: expected number, actual=COSFloat{18446744072911454224} at offset 1182692
	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:166) ~[?:?]
	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:279) ~[?:?]
	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:212) ~[?:?]
	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:862) ~[?:?]
	at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:852) ~[?:?]
	at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:821) ~[?:?]
	at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:741) ~[?:?]
	at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:701) ~[?:?]
	at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:205) ~[?:?]
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240) ~[?:?]
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144) ~[?:?]
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1117) ~[?:?]
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153) ~[?:?]
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[?:?]
	at org.apache.tika.Tika.parseToString(Tika.java:568) ~[?:?]
	at org.elasticsearch.ingest.attachment.TikaImpl.lambda$parse$0(TikaImpl.java:108) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_191]
	at org.elasticsearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:107) ~[?:?]
	at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:101) ~[?:?]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-6.4.2.jar:6.4.2]
	... 9 more

(David Pilato) #2

That's an error coming from PDFBox which is used by Tika which is used by the ingest-attachment plugin.

Could you share the document which is generating this error?
Could you also upgrade to 6.4.3 at least? It contains a Tika update which has hopefully a PDFBox update which might fix that problem. See


(Thomas) #3

Thanks for your quick response.

I've already provided a link to the PDF-document in the original post :slight_smile:

We've just went through an upgrade from 1.7.x to 6.4.2, but if it will fix the issue we will definitely look into an upgrade!


(David Pilato) #4

Why did you choose an old elasticsearch version instead of the most recent one if you just upgraded?

Anyway, I just tried with the latest version of my personal project (FSCrawler) which is using the latest version of Tika (1.20) and sadly I'm getting the same issue.

BTW you have to know that ingest-attachment does not support OCR AFAIK. So I'm not sure what you could really extract from that document. :slight_smile:

I'd recommend opening an issue in PDFBox may be with this document.


(Thomas) #5

6.4.2 was the newest version at the time we did the upgrade.

Well - it’s extracting text from all other OCR-processed documents, so I guess you’re wrong about that fact :slight_smile:


(David Pilato) #6

I'm surprised. Because OCR works by calling a binary. And I thought that the security manager was not allowing that.
May be I did not try it for a long time. :slight_smile:


(David Pilato) #7

When I look at the source code and the security policy, I really wonder if OCR is supposed to work...

And here I can see that TesseractParser is omitted:

So I tried the following code.

It contains a PDF document which has a text (green) and an image containing some text (yellow background):

After running this, I can see that only the text is extracted which means that OCR is not performed:

{
  "_index" : "my_index",
  "_type" : "_doc",
  "_id" : "my_id",
  "_version" : 5,
  "_seq_no" : 4,
  "_primary_term" : 2,
  "found" : true,
  "_source" : {
    "attachment" : {
      "date" : "2019-03-02T13:42:36Z",
      "keywords" : "keyword1, keyword2",
      "content_type" : "application/pdf",
      "author" : "David Pilato",
      "language" : "en",
      "title" : "Test Tika title",
      "content" : """
This file also contains text. 

  



This second part of the text is in Page 2
""",
      "content_length" : 93
    }
  }
}

Could you check on your platform if you have a different result than mine? As far as I can recall, OCR has been removed a long time ago.

Thanks!


(Thomas) #8

Oh - I'm not claiming that ingest-attchment will perform the OCR-process. I'm just saying that the document has been PDF-processed (with another tool - which adds text layers to the PDF-document)


(David Pilato) #9

Ha ok!


(system) closed #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.