Mapper Attachment Plugin


(Brian DeVaca) #1

I've installed the mapper attachment plugin and have been loading different types of attachments and have been doing searches on words that are in the attachment content. It seems to work fine for .txt, .doc, .rtf, .pdf However, I can't get it to find content in .xlsx or .ppt files.

Using Sense, I can see that the .xlsx and .ppt files were loaded, but the query is not returning any results when I search for words that are in these attachments. Any ideas?

This is an example of my query:

GET /test/attachment/_search
{
"fields": [],
"query": {
"match": {
"file.content": "float"
}
},
"highlight": {
"fields": {
"file.content": {
}
}
}
}


Mapper Attachment Plugin - error when uploading .xlsx attachments
(David Pilato) #2

If you stored the content field, can you show its content so we know what has been extracted?


(Brian DeVaca) #3

I've placed the file in an attachment. It exceeded the reply to character max. It may be too small to be helpful. I can email the text if you'd like.


(David Pilato) #4

No I meant something like _search?fields=file.content


(Brian DeVaca) #5

when I do a search for content like this..
GET /test/attachment/_search?fields=file.content
{

}

This is the result. Does this mean there was an issue loading the content?
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "attachment",
"_id": "54881",
"_score": 1
}
]
}
}


(David Pilato) #6

Can you confirm that you can see content from other documents?

If not, it probably means that you did not change the mapping to store the field.

If you can see the content for other docs, then it means that indeed something went wrong at index time.

LMK


(Brian DeVaca) #7

I can see the content for other types of attachments. How can I troubleshoot the failure to load the content for the .xlsx and .ppt files? I'm basically taking an inputstream of the attachment content and encoding it with base64 and storing it as a string.

		// Encode attachment content using basic encoder
		byte[] bytes = IOUtils.toByteArray(inputStream);
		String base64encodedString = Base64.getEncoder().encodeToString(bytes);

(Brian DeVaca) #8

I turned on ignore errors to false and I see this error now when loading .xlsx files.

{"error":
{"root_cause":[
{"type":"mapper_parsing_exception","reason":"Failed to extract [-1] characters of text for [null] :
Could not initialize class org.apache.poi.POIXMLProperties"}],
"type":"mapper_parsing_exception","reason":"Failed to extract [-1] characters of text for [null] :
Could not initialize class org.apache.poi.POIXMLProperties","caused_by":{"type":"no_class_def_found_error","reason":"Could not initialize class org.apache.poi.POIXMLProperties"}},
"status":400}


(David Pilato) #9

Great. Can you share your file?

Even privately at david at elastic dot co and I'll give a look next week.


(Brian DeVaca) #10

Thanks! I just send an email with my .xlsx attachment. I'll continue to troubleshoot on my end as well.


(David Pilato) #11

Great!

Can you open an issue with that trace?
And link to this thread in the issue?

I think we are missing a library here. I hope we can fix it if we don't have jar hell issues.

Thanks a lot for investigating!


(Brian DeVaca) #12

Sure! I'm actually trying to mess with jars right now. I'll open up a new issue and link it.

Thanks!


(David Pilato) #13

I meant open an issue in GitHub not on discuss :slight_smile:


(Brian DeVaca) #14

Oh, ok. Sure!


(Brian DeVaca) #15

I created an issue on GitHub here:


(David Pilato) #16

I meant in elasticsearch repository where the project and the plugin is living.

So here:


(Brian DeVaca) #17

Ok. I think I got it this time. I'm not sure how to link it to this issue but here is the url:


(system) #18

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.