Error while using ingest attachment plugin on some docs


(Иван Сорокин) #1

Hello, I have some problems with ingest attachment plugin. It works most of the time but fails on some docs (.docx) with exception

2018-10-24T15:00:44,595][DEBUG][o.e.a.b.TransportBulkAction] [kz-el03-node2] failed to execute pipeline [attachment] for document [attachment-2018-10-18/attachments/2264af3b-7a48-4b7b-ae37-c256472de62c]
org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$2@56c6f26e]; nested: NullPointerException;
	at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:167) ~[elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:41) ~[elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.ingest.PipelineExecutionService$2.doRun(PipelineExecutionService.java:88) [elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:675) [elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.11.jar:5.6.11]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$2@56c6f26e]; nested: NullPointerException;

ES version 5.6.11. I tried to open this directly in tika-app-18.jar and got same error.
I can send docs if needed. Thanks.


(Alexander Reelsen) #2

Please provide the full stack trace, also the processor configuration and a reproducible example would be nice.

Lastly, can you verify if this error still happens in the newest version of Elasticsearch, being 6.4.2 at the time being.

Thanks.


(Иван Сорокин) #3

Examples: google drive zip file

Stack trace: pastebin

Prosessor configuration: 2 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, heap size 31GB

Sadly, I can't verify if this error still happens in the newest version of Elasticsearch at the moment because I can't easily upgrade to ver 6.4.2. Also I checked these files on the newest tika-app (1,19) and it failed too with the same error.


(Alexander Reelsen) #4

this indeed looks more like a tika issue judging from the stack trace. Would it be possible to create an example without elasticsearch and verify this happens, and then file a bug against tika?


(Tim Allison) #5

Y. Please open an issue on Tika’s JIRA and I’ll take a look. We may be fairly close to the next POI release, and the fix is likely trivial.


(Иван Сорокин) #6

How to open an issue? I didn't find out how to do it on https://issues.apache.org/jira/projects/TIKA/issues/


(Tim Allison) #7

Once you create an account, login and you’ll see a “create” issue button at the top.


(Tim Allison) #8

I opened this on POI's bugzilla: https://bz.apache.org/bugzilla/show_bug.cgi?id=62859

Once I finish local integration tests, I'll commit the fix, which will make it into the next version of POI and Tika.

Thank you for sharing the full stacktrace!


(Иван Сорокин) #9

Thank you! Do I still need to create an issue on JIRA after you created bug in bugzilla?


(Tim Allison) #10

If you are able to share the document, I’d be interested to see what structure led to an empty sdtcontent. I wasn’t able to replicate the npe by manually editing the xml. If you can’t share, all is ok. Thank you again.


(Иван Сорокин) #11

I created an issue here https://issues.apache.org/jira/browse/TIKA-2769. Is it ok?


(Tim Allison) #12

Y. Perfect...almost. See my comments there. Thank you!


(Tim Allison) #13

To close the loop here. Иван shared docs on the Tika JIRA issue that triggered: 1) an NPE and 2) a class cast exception.

I fixed the NPE in POI, and I just prevented the class cast exception for template/glossary documents.

POI doesn't currently support glossary documents, but we shouldn't through a ClassCastException! If you do want to extract info from glossary documents, you can use Tika's beta-level SAX parser for docx files.

Please see the Tika issue for references to the issues in POI.

Both of these fixes will be available in the next version of POI and Tika. POI should be out in the next few weeks.

Thank you, again, @7ca884569f2642e2c488 and @spinscale !


(system) #14

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.