Error while using ingest attachment plugin on some docs

Hello, I have some problems with ingest attachment plugin. It works most of the time but fails on some docs (.docx) with exception

2018-10-24T15:00:44,595][DEBUG][o.e.a.b.TransportBulkAction] [kz-el03-node2] failed to execute pipeline [attachment] for document [attachment-2018-10-18/attachments/2264af3b-7a48-4b7b-ae37-c256472de62c]
org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$2@56c6f26e]; nested: NullPointerException;
	at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:167) ~[elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:41) ~[elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.ingest.PipelineExecutionService$2.doRun(PipelineExecutionService.java:88) [elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:675) [elasticsearch-5.6.11.jar:5.6.11]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.11.jar:5.6.11]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$2@56c6f26e]; nested: NullPointerException;

ES version 5.6.11. I tried to open this directly in tika-app-18.jar and got same error.
I can send docs if needed. Thanks.

Please provide the full stack trace, also the processor configuration and a reproducible example would be nice.

Lastly, can you verify if this error still happens in the newest version of Elasticsearch, being 6.4.2 at the time being.

Thanks.

Examples: google drive zip file

Stack trace: pastebin

Prosessor configuration: 2 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, heap size 31GB

Sadly, I can't verify if this error still happens in the newest version of Elasticsearch at the moment because I can't easily upgrade to ver 6.4.2. Also I checked these files on the newest tika-app (1,19) and it failed too with the same error.

this indeed looks more like a tika issue judging from the stack trace. Would it be possible to create an example without elasticsearch and verify this happens, and then file a bug against tika?

Y. Please open an issue on Tika’s JIRA and I’ll take a look. We may be fairly close to the next POI release, and the fix is likely trivial.

How to open an issue? I didn't find out how to do it on https://issues.apache.org/jira/projects/TIKA/issues/

Once you create an account, login and you’ll see a “create” issue button at the top.

I opened this on POI's bugzilla: https://bz.apache.org/bugzilla/show_bug.cgi?id=62859

Once I finish local integration tests, I'll commit the fix, which will make it into the next version of POI and Tika.

Thank you for sharing the full stacktrace!

Thank you! Do I still need to create an issue on JIRA after you created bug in bugzilla?

If you are able to share the document, I’d be interested to see what structure led to an empty sdtcontent. I wasn’t able to replicate the npe by manually editing the xml. If you can’t share, all is ok. Thank you again.

I created an issue here https://issues.apache.org/jira/browse/TIKA-2769. Is it ok?

Y. Perfect...almost. See my comments there. Thank you!

To close the loop here. Иван shared docs on the Tika JIRA issue that triggered: 1) an NPE and 2) a class cast exception.

I fixed the NPE in POI, and I just prevented the class cast exception for template/glossary documents.

POI doesn't currently support glossary documents, but we shouldn't through a ClassCastException! If you do want to extract info from glossary documents, you can use Tika's beta-level SAX parser for docx files.

Please see the Tika issue for references to the issues in POI.

Both of these fixes will be available in the next version of POI and Tika. POI should be out in the next few weeks.

Thank you, again, @7ca884569f2642e2c488 and @spinscale !

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.