Attachments plugin - has anyone been using this successfully?


(DKichler) #1

Hi there,

I've recently began playing around with ES and found the attachments plugin
very interesting. I've implemented using the plugin, based on the brief
description given in the
docshttp://www.elasticsearch.com/docs/elasticsearch/mapping/attachment/but
could not get the results I expected. Now either my understanding of
what the plugin does is inaccurate, or I've been using/mapping the plugin
incorrectly somehow.

As I understand it, the attachments plugin is designed to bootstrap the
content extraction capabilities provided by Tika.

Based on the small tutorial in the docs, the content is provided in the
request as base64 encoded data. With minor changes to the most basic
example, my mapping looks like:

{"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"file" : {"index" : "analyzed", "store" : "yes"},
"date" : {"index" : "analyzed", "store" : "yes"}
}
}
}}

The json document I pass with an index request looks like:

{ "file" : "JVBERi0xLjQNJeLjz9MNCjQ4IDAgb2JqPDw...." }
(content is base64 encoded raw binary data read from a .pdf document)

Now, based on my limited understanding of how the attachment plugin works, I
would expect it to pass the provided content through Tika's extraction
process, then index the extracted content based on the mapping provided.
However, when I retrieve the recently indexed document again, the file field
contains only the raw base64 data that was provided with the index request
(above). Am I missing something? is this the correct behaviour or have I
implemented the plugin incorrectly?

Has anyone out there used this plugin extensively? or at all? Any tips on
what I may be doing wrong are greatly appreciated.

Thanks,
Dave K


(system) #2