ElasticsearchParseException using Ingest Attachment Processor Plugin in Elasticsearch 6.4.2

thbaan · March 19, 2019, 10:30am

Hi there

We use the Ingest Attachment Processor Plugin in Elasticsearch 6.4.2 for indexing content from OCR-processed PDF files. Today a user reported an error regarding to a specific document which is provided here. Stack trace from Elasticsearch enclosed below.

Can you help me inspect what's causing this error and eventually how to avoid it. I've checked the release notes for newer versions, but none of them seems to fix this.

Thank you in advance.

Here's the full stack trace from the Elasticsearch log:

[2019-03-19T09:05:53,075][DEBUG][o.e.a.b.TransportBulkAction] [ELASTICSEARCH01] failed to execute pipeline [attachment] for document [kildeviserindex_udvikling/kildeviserindexmodel/161]
org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@55389b84]; nested: IOException[expected number, actual=COSFloat{18446744072911454224} at offset 1182692];
	at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:155) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.ingest.PipelineExecutionService.access$100(PipelineExecutionService.java:43) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.ingest.PipelineExecutionService$1.doRun(PipelineExecutionService.java:78) [elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.4.2.jar:6.4.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_191]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_191]
	at java.lang.Thread.run(Unknown Source) [?:1.8.0_191]
Caused by: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@55389b84]; nested: IOException[expected number, actual=COSFloat{18446744072911454224} at offset 1182692];
	... 11 more
Caused by: org.elasticsearch.ElasticsearchParseException: Error parsing document in field [data]
	at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:106) ~[?:?]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-6.4.2.jar:6.4.2]
	... 9 more
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@55389b84
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) ~[?:?]
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[?:?]
	at org.apache.tika.Tika.parseToString(Tika.java:568) ~[?:?]
	at org.elasticsearch.ingest.attachment.TikaImpl.lambda$parse$0(TikaImpl.java:108) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_191]
	at org.elasticsearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:107) ~[?:?]
	at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:101) ~[?:?]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-6.4.2.jar:6.4.2]
	... 9 more
Caused by: java.io.IOException: expected number, actual=COSFloat{18446744072911454224} at offset 1182692
	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:166) ~[?:?]
	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:279) ~[?:?]
	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:212) ~[?:?]
	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:862) ~[?:?]
	at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:852) ~[?:?]
	at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:821) ~[?:?]
	at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:741) ~[?:?]
	at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:701) ~[?:?]
	at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:205) ~[?:?]
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240) ~[?:?]
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144) ~[?:?]
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1117) ~[?:?]
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153) ~[?:?]
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[?:?]
	at org.apache.tika.Tika.parseToString(Tika.java:568) ~[?:?]
	at org.elasticsearch.ingest.attachment.TikaImpl.lambda$parse$0(TikaImpl.java:108) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_191]
	at org.elasticsearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:107) ~[?:?]
	at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:101) ~[?:?]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-6.4.2.jar:6.4.2]
	... 9 more

dadoonet · March 19, 2019, 10:46am

That's an error coming from PDFBox which is used by Tika which is used by the ingest-attachment plugin.

Could you share the document which is generating this error?
Could you also upgrade to 6.4.3 at least? It contains a Tika update which has hopefully a PDFBox update which might fix that problem. See

thbaan · March 19, 2019, 11:56am

Thanks for your quick response.

I've already provided a link to the PDF-document in the original post

We've just went through an upgrade from 1.7.x to 6.4.2, but if it will fix the issue we will definitely look into an upgrade!

dadoonet · March 19, 2019, 2:12pm

Why did you choose an old elasticsearch version instead of the most recent one if you just upgraded?

Anyway, I just tried with the latest version of my personal project (FSCrawler) which is using the latest version of Tika (1.20) and sadly I'm getting the same issue.

BTW you have to know that ingest-attachment does not support OCR AFAIK. So I'm not sure what you could really extract from that document.

I'd recommend opening an issue in PDFBox may be with this document.

thbaan · March 19, 2019, 2:48pm

6.4.2 was the newest version at the time we did the upgrade.

Well - it’s extracting text from all other OCR-processed documents, so I guess you’re wrong about that fact

dadoonet · March 19, 2019, 3:10pm

I'm surprised. Because OCR works by calling a binary. And I thought that the security manager was not allowing that.
May be I did not try it for a long time.

dadoonet · March 19, 2019, 4:31pm

When I look at the source code and the security policy, I really wonder if OCR is supposed to work...

github.com

elastic/elasticsearch/blob/master/plugins/ingest-attachment/src/main/plugin-metadata/plugin-security.policy#L21-L36


grant {
  // needed to apply additional sandboxing to tika parsing
  permission java.security.SecurityPermission "createAccessControlContext";


  // TODO: fix PDFBox not to actually install bouncy castle like this
  permission java.security.SecurityPermission "putProviderProperty.BC";
  permission java.security.SecurityPermission "insertProvider";
  // TODO: fix POI XWPF to not do this: https://bz.apache.org/bugzilla/show_bug.cgi?id=58597
  permission java.lang.reflect.ReflectPermission "suppressAccessChecks";
  // needed by xmlbeans, as part of POI for MS xml docs
  permission java.lang.RuntimePermission "getClassLoader";
  // ZipFile needs accessDeclaredMembers on Java 10
  permission java.lang.RuntimePermission "accessDeclaredMembers";
  // PDFBox checks for the existence of this class
  permission java.lang.RuntimePermission "accessClassInPackage.sun.java2d.cmm.kcms";
};

And here I can see that TesseractParser is omitted:

github.com

elastic/elasticsearch/blob/master/plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/TikaImpl.java#L78-L91


private static final Parser PARSERS[] = new Parser[] {
    // documents
    new org.apache.tika.parser.html.HtmlParser(),
    new org.apache.tika.parser.rtf.RTFParser(),
    new org.apache.tika.parser.pdf.PDFParser(),
    new org.apache.tika.parser.txt.TXTParser(),
    new org.apache.tika.parser.microsoft.OfficeParser(),
    new org.apache.tika.parser.microsoft.OldExcelParser(),
    ParserDecorator.withoutTypes(new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), EXCLUDES),
    new org.apache.tika.parser.odf.OpenDocumentParser(),
    new org.apache.tika.parser.iwork.IWorkPackageParser(),
    new org.apache.tika.parser.xml.DcXMLParser(),
    new org.apache.tika.parser.epub.EpubParser(),
};

So I tried the following code.

gist.github.com

https://gist.github.com/dadoonet/dc148fbfe19a9a4ee56d667efdcc6a6f

ingest.kibana

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    },{
      "remove": {

This file has been truncated. show original

It contains a PDF document which has a text (green) and an image containing some text (yellow background):

After running this, I can see that only the text is extracted which means that OCR is not performed:

{
  "_index" : "my_index",
  "_type" : "_doc",
  "_id" : "my_id",
  "_version" : 5,
  "_seq_no" : 4,
  "_primary_term" : 2,
  "found" : true,
  "_source" : {
    "attachment" : {
      "date" : "2019-03-02T13:42:36Z",
      "keywords" : "keyword1, keyword2",
      "content_type" : "application/pdf",
      "author" : "David Pilato",
      "language" : "en",
      "title" : "Test Tika title",
      "content" : """
This file also contains text. 

  



This second part of the text is in Page 2
""",
      "content_length" : 93
    }
  }
}

Could you check on your platform if you have a different result than mine? As far as I can recall, OCR has been removed a long time ago.

Thanks!

thbaan · March 20, 2019, 2:12pm

Oh - I'm not claiming that ingest-attchment will perform the OCR-process. I'm just saying that the document has been PDF-processed (with another tool - which adds text layers to the PDF-document)

dadoonet · March 20, 2019, 2:31pm

Ha ok!

system · April 17, 2019, 2:31pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error while using ingest attachment plugin on some docs Elasticsearch	13	1744	November 29, 2018
Troubles with different file types using ingest attachment processor plugin Elasticsearch	8	3204	February 23, 2017
Getting error while parsing documents Elasticsearch	13	6431	June 8, 2017
RuntimeException while parsing doc file Elasticsearch	6	1022	March 20, 2019
Ingest attachment Plugin exception : Elasticsearch	8	4432	January 19, 2017

ElasticsearchParseException using Ingest Attachment Processor Plugin in Elasticsearch 6.4.2

Related topics