RuntimeException while parsing doc file

Hi,

I am getting the following exception when I try the

PUT on kibana with ElasticSearch 6.6.0 with ingest-attachment-6.6.0

{
"error": {
"root_cause": [
{
"type": "exception",
"reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [resumeB64]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@2db961e1]; nested: ArrayIndexOutOfBoundsException;",
"header":

{ "processor_type": "attachment" }

}
],
"type": "exception",
"reason": "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [resumeB64]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@2db961e1]; nested: ArrayIndexOutOfBoundsException;",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "ElasticsearchParseException[Error parsing document in field [resumeB64]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@2db961e1]; nested: ArrayIndexOutOfBoundsException;",
"caused_by": {
"type": "parse_exception",
"reason": "Error parsing document in field [resumeB64]",
"caused_by": {
"type": "tika_exception",
"reason": "Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@2db961e1",
"caused_by":

{ "type": "array_index_out_of_bounds_exception", "reason": null }

}
}
},
"header":

{ "processor_type": "attachment" }

},
"status": 500
}

Following is the exception from Rest client

{ "extendedStackTrace": "org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=exception, reason=java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [resumeB64]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@3731154b]; nested: ArrayIndexOutOfBoundsException;]\n\tat org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:177) ~[elasticsearch-6.5.0.jar!/:6.5.0]\n\tat org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1793) ~[elasticsearch-rest-high-level-client-6.5.0.jar!/:6.5.0]\n\tat org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1769) ~[elasticsearch-rest-high-level-client-6.5.0.jar!/:6.5.0]\n\tat org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1606) ~[elasticsearch-rest-high-level-client-6.5.0.jar!/:6.5.0]\n\tat org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1563) ~[elasticsearch-rest-high-level-client-6.5.0.jar!/:6.5.0]\n\tat org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1546) ~[elasticsearch-rest-high-level-client-6.5.0.jar!/:6.5.0]\n\tat ....", "name": "org.elasticsearch.ElasticsearchStatusException" }

Can anyone please help me to resolve this issue. With the limitation in content length I am unable to share the base64 string. I can provide that in mail if it helps in debugging the issue.

Share the xml file on gist.github.com.

Thanks for assisting. The PUT commands are available here

Can you share the original file as well? (Not the BASE64 one).

I have shared the html which caused SAXParserException in the same git hub location however the other file I cannot share as it has confidential data.

I can reproduce the problem.
It is still failing on Tika level at it seems that the document is not "valid".

When I add manually <html> and </html> in your document, it is then possible to parse it.

I'm not sure there is a workaround. May be you can ask on Tika mailing list?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.