Hello, I have some problems with ingest attachment plugin. It works most of the time but fails on some docs (.docx) with exception
2018-10-24T15:00:44,595][DEBUG][o.e.a.b.TransportBulkAction] [kz-el03-node2] failed to execute pipeline [attachment] for document [attachment-2018-10-18/attachments/2264af3b-7a48-4b7b-ae37-c256472de62c]
org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$2@56c6f26e]; nested: NullPointerException;
at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-5.6.11.jar:5.6.11]
at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-5.6.11.jar:5.6.11]
at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-5.6.11.jar:5.6.11]
at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:167) ~[elasticsearch-5.6.11.jar:5.6.11]
at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:41) ~[elasticsearch-5.6.11.jar:5.6.11]
at org.elasticsearch.ingest.PipelineExecutionService$2.doRun(PipelineExecutionService.java:88) [elasticsearch-5.6.11.jar:5.6.11]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:675) [elasticsearch-5.6.11.jar:5.6.11]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.11.jar:5.6.11]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$2@56c6f26e]; nested: NullPointerException;
ES version 5.6.11. I tried to open this directly in tika-app-18.jar and got same error.
I can send docs if needed. Thanks.
Prosessor configuration: 2 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, heap size 31GB
Sadly, I can't verify if this error still happens in the newest version of Elasticsearch at the moment because I can't easily upgrade to ver 6.4.2. Also I checked these files on the newest tika-app (1,19) and it failed too with the same error.
this indeed looks more like a tika issue judging from the stack trace. Would it be possible to create an example without elasticsearch and verify this happens, and then file a bug against tika?
If you are able to share the document, I’d be interested to see what structure led to an empty sdtcontent. I wasn’t able to replicate the npe by manually editing the xml. If you can’t share, all is ok. Thank you again.
To close the loop here. Иван shared docs on the Tika JIRA issue that triggered: 1) an NPE and 2) a class cast exception.
I fixed the NPE in POI, and I just prevented the class cast exception for template/glossary documents.
POI doesn't currently support glossary documents, but we shouldn't through a ClassCastException! If you do want to extract info from glossary documents, you can use Tika's beta-level SAX parser for docx files.
Please see the Tika issue for references to the issues in POI.
Both of these fixes will be available in the next version of POI and Tika. POI should be out in the next few weeks.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.