Mapper Attachment Plugin

bdevaca · December 28, 2016, 8:40pm

I've installed the mapper attachment plugin and have been loading different types of attachments and have been doing searches on words that are in the attachment content. It seems to work fine for .txt, .doc, .rtf, .pdf However, I can't get it to find content in .xlsx or .ppt files.

Using Sense, I can see that the .xlsx and .ppt files were loaded, but the query is not returning any results when I search for words that are in these attachments. Any ideas?

This is an example of my query:

GET /test/attachment/_search
{
"fields": [],
"query": {
"match": {
"file.content": "float"
}
},
"highlight": {
"fields": {
"file.content": {
}
}
}
}

dadoonet · December 28, 2016, 9:32pm

If you stored the content field, can you show its content so we know what has been extracted?

bdevaca · December 28, 2016, 9:46pm

I've placed the file in an attachment. It exceeded the reply to character max. It may be too small to be helpful. I can email the text if you'd like.

dadoonet · December 28, 2016, 9:59pm

No I meant something like _search?fields=file.content

bdevaca · December 28, 2016, 11:02pm

when I do a search for content like this..
GET /test/attachment/_search?fields=file.content
{

}

This is the result. Does this mean there was an issue loading the content?
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "attachment",
"_id": "54881",
"_score": 1
}
]
}
}

dadoonet · December 29, 2016, 4:40am

Can you confirm that you can see content from other documents?

If not, it probably means that you did not change the mapping to store the field.

If you can see the content for other docs, then it means that indeed something went wrong at index time.

LMK

bdevaca · December 29, 2016, 12:53pm

I can see the content for other types of attachments. How can I troubleshoot the failure to load the content for the .xlsx and .ppt files? I'm basically taking an inputstream of the attachment content and encoding it with base64 and storing it as a string.

		// Encode attachment content using basic encoder
		byte[] bytes = IOUtils.toByteArray(inputStream);
		String base64encodedString = Base64.getEncoder().encodeToString(bytes);

bdevaca · December 29, 2016, 1:21pm

I turned on ignore errors to false and I see this error now when loading .xlsx files.

{"error":
{"root_cause":[
{"type":"mapper_parsing_exception","reason":"Failed to extract [-1] characters of text for [null] :
Could not initialize class org.apache.poi.POIXMLProperties"}],
"type":"mapper_parsing_exception","reason":"Failed to extract [-1] characters of text for [null] :
Could not initialize class org.apache.poi.POIXMLProperties","caused_by":{"type":"no_class_def_found_error","reason":"Could not initialize class org.apache.poi.POIXMLProperties"}},
"status":400}

dadoonet · December 29, 2016, 1:22pm

Great. Can you share your file?

Even privately at david at elastic dot co and I'll give a look next week.

bdevaca · December 29, 2016, 1:28pm

Thanks! I just send an email with my .xlsx attachment. I'll continue to troubleshoot on my end as well.

dadoonet · December 29, 2016, 2:13pm

Great!

Can you open an issue with that trace?
And link to this thread in the issue?

I think we are missing a library here. I hope we can fix it if we don't have jar hell issues.

Thanks a lot for investigating!

bdevaca · December 29, 2016, 2:15pm

Sure! I'm actually trying to mess with jars right now. I'll open up a new issue and link it.

Thanks!

dadoonet · December 29, 2016, 2:33pm

I meant open an issue in GitHub not on discuss

bdevaca · December 29, 2016, 2:34pm

Oh, ok. Sure!

bdevaca · December 29, 2016, 2:40pm

I created an issue on GitHub here:

dadoonet · December 29, 2016, 4:37pm

I meant in elasticsearch repository where the project and the plugin is living.

So here:

bdevaca · December 29, 2016, 4:45pm

Ok. I think I got it this time. I'm not sure how to link it to this issue but here is the url:

system · January 26, 2017, 4:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem with .docx and the mapper-attachments plugin Elasticsearch	16	2073	July 5, 2017
Problems with searching from document contents with mapper-attachments plugin Elasticsearch	7	1699	July 5, 2017
Not able to search through attachment contents Elasticsearch	32	7919	July 5, 2017
Cannot use elasticsearch-mapper-attachments successfully Elasticsearch	1	476	July 5, 2017
Attachment Mapper and Searching Elasticsearch	7	894	July 5, 2017

Mapper Attachment Plugin

Related topics