Mapper Attachment Plugin

I've installed the mapper attachment plugin and have been loading different types of attachments and have been doing searches on words that are in the attachment content. It seems to work fine for .txt, .doc, .rtf, .pdf However, I can't get it to find content in .xlsx or .ppt files.

Using Sense, I can see that the .xlsx and .ppt files were loaded, but the query is not returning any results when I search for words that are in these attachments. Any ideas?

This is an example of my query:

GET /test/attachment/_search
{
"fields": [],
"query": {
"match": {
"file.content": "float"
}
},
"highlight": {
"fields": {
"file.content": {
}
}
}
}

If you stored the content field, can you show its content so we know what has been extracted?

I've placed the file in an attachment. It exceeded the reply to character max. It may be too small to be helpful. I can email the text if you'd like.

No I meant something like _search?fields=file.content

when I do a search for content like this..
GET /test/attachment/_search?fields=file.content
{

}

This is the result. Does this mean there was an issue loading the content?
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "attachment",
"_id": "54881",
"_score": 1
}
]
}
}

Can you confirm that you can see content from other documents?

If not, it probably means that you did not change the mapping to store the field.

If you can see the content for other docs, then it means that indeed something went wrong at index time.

LMK

I can see the content for other types of attachments. How can I troubleshoot the failure to load the content for the .xlsx and .ppt files? I'm basically taking an inputstream of the attachment content and encoding it with base64 and storing it as a string.

		// Encode attachment content using basic encoder
		byte[] bytes = IOUtils.toByteArray(inputStream);
		String base64encodedString = Base64.getEncoder().encodeToString(bytes);

I turned on ignore errors to false and I see this error now when loading .xlsx files.

{"error":
{"root_cause":[
{"type":"mapper_parsing_exception","reason":"Failed to extract [-1] characters of text for [null] :
Could not initialize class org.apache.poi.POIXMLProperties"}],
"type":"mapper_parsing_exception","reason":"Failed to extract [-1] characters of text for [null] :
Could not initialize class org.apache.poi.POIXMLProperties","caused_by":{"type":"no_class_def_found_error","reason":"Could not initialize class org.apache.poi.POIXMLProperties"}},
"status":400}

Great. Can you share your file?

Even privately at david at elastic dot co and I'll give a look next week.

Thanks! I just send an email with my .xlsx attachment. I'll continue to troubleshoot on my end as well.

Great!

Can you open an issue with that trace?
And link to this thread in the issue?

I think we are missing a library here. I hope we can fix it if we don't have jar hell issues.

Thanks a lot for investigating!

Sure! I'm actually trying to mess with jars right now. I'll open up a new issue and link it.

Thanks!

I meant open an issue in GitHub not on discuss :slight_smile:

Oh, ok. Sure!

I created an issue on GitHub here:

I meant in elasticsearch repository where the project and the plugin is living.

So here:

Ok. I think I got it this time. I'm not sure how to link it to this issue but here is the url:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.