Attachment parsing causes error for .docs


(js) #1

Hello,
Tika respective POI throws an error when it comes to parse MS Word documents that have images referenced which are not available. That is what might be possible analyzing the exception.
That is what elasticsearch is throwing:

org.elasticsearch.index.mapper.MapperParsingException: Failed to extract text for [null]
at org.elasticsearch.index.mapper.xcontent.AttachmentMapper.parse(AttachmentMapper.java:256)
at org.elasticsearch.index.mapper.xcontent.ObjectMapper.serializeValue(ObjectMapper.java:397)
at org.elasticsearch.index.mapper.xcontent.ObjectMapper.parse(ObjectMapper.java:309)
at org.elasticsearch.index.mapper.xcontent.ObjectMapper.serializeObject(ObjectMapper.java:330)
at org.elasticsearch.index.mapper.xcontent.ObjectMapper.parse(ObjectMapper.java:301)
at org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:429)
at org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:363)
at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:250)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:187)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:418)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.access$100(TransportShardReplicationOperationAction.java:233)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:331)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@5521f4ef
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.Tika.parseToString(Tika.java:357)
at org.elasticsearch.index.mapper.xcontent.AttachmentMapper.parse(AttachmentMapper.java:254)
... 14 more
Caused by: java.lang.NegativeArraySizeException
at org.apache.poi.hwpf.usermodel.Picture.fillRawImageContent(Picture.java:314)
at org.apache.poi.hwpf.usermodel.Picture.getRawContent(Picture.java:174)
at org.apache.poi.hwpf.usermodel.Picture.fillImageContent(Picture.java:320)
at org.apache.poi.hwpf.usermodel.Picture.suggestFileExtension(Picture.java:263)
at org.apache.poi.hwpf.usermodel.Picture.suggestFileExtension(Picture.java:210)
at org.apache.poi.hwpf.usermodel.Picture.suggestFullFileName(Picture.java:137)
at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.(WordExtractor.java:436)
at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.(WordExtractor.java:420)
at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:75)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 18 more

I tried then the following "three lines code":

FileInputStream openInputStream = FileUtils.openInputStream(new File(filename));
WordExtractor extractor = new WordExtractor(openInputStream);
String text = extractor.getText(); openInputStream.close();

And I get a valid text extraction even though the images are still not accessible.

So now I am wondering if we can switch on runtime somehow the way of parsing and analyzing documents. The attachment mapping is working so far for all the rest except when it comes to those stupid word documents. Is there any possible to suppress the attachment parsing part on runtime and do the text extraction separately, for example in another field? The thing is we need the indexing of the documents content and we need to return the entire file if requested.

Thanks for any help.


(system) #2