I've installed the mapper attachment plugin and have been loading different types of attachments and have been doing searches on words that are in the attachment content. It seems to work fine for .txt, .doc, .rtf, .pdf However, I can't get it to find content in .xlsx or .ppt files.
Using Sense, I can see that the .xlsx and .ppt files were loaded, but the query is not returning any results when I search for words that are in these attachments. Any ideas?
when I do a search for content like this..
GET /test/attachment/_search?fields=file.content
{
}
This is the result. Does this mean there was an issue loading the content?
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "attachment",
"_id": "54881",
"_score": 1
}
]
}
}
I can see the content for other types of attachments. How can I troubleshoot the failure to load the content for the .xlsx and .ppt files? I'm basically taking an inputstream of the attachment content and encoding it with base64 and storing it as a string.
I turned on ignore errors to false and I see this error now when loading .xlsx files.
{"error":
{"root_cause":[
{"type":"mapper_parsing_exception","reason":"Failed to extract [-1] characters of text for [null] :
Could not initialize class org.apache.poi.POIXMLProperties"}],
"type":"mapper_parsing_exception","reason":"Failed to extract [-1] characters of text for [null] :
Could not initialize class org.apache.poi.POIXMLProperties","caused_by":{"type":"no_class_def_found_error","reason":"Could not initialize class org.apache.poi.POIXMLProperties"}},
"status":400}
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.