Attachment plugin: flexibility?

jeremie_bordier · October 12, 2011, 10:33am

Hi !

First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.

I have a few questions about the attachment plugin:

I tried to configure the attachment mapping adding a "size" long
field but it always gets dropped. I had a look at the code and it
seems that the attachement type takes in account existing
configuration of the standard fields, but drops the other, Why ?

curl -X PUT "${host}/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"size" : { "type" : "long", "store" : yes }
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets",
"store":"yes" }
}
}
}
}
}'

When I index simple PDF files and try to fetch back the
content_type, I don't get the response:

gist.github.com

https://gist.github.com/anonymous/1280860

gistfile1.txt

## REQUEST

{
  "fields": [
    "file.title",
    "file.content_type"
  ],
  "query": {
    "query_string": {
      "query": "amplifier"

This file has been truncated. show original

I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?

Thanks again !

Jérémie

jeremie_bordier · October 12, 2011, 4:37pm

On Wed, Oct 12, 2011 at 12:33 PM, jeremie.bordier@gmail.com
ahfeel@gmail.com wrote:

Hi !

First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.

I have a few questions about the attachment plugin:

I tried to configure the attachment mapping adding a "size" long
field but it always gets dropped. I had a look at the code and it
seems that the attachement type takes in account existing
configuration of the standard fields, but drops the other, Why ?

curl -X PUT "${host}/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"size" : { "type" : "long", "store" : yes }
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets",
"store":"yes" }
}
}
}
}
}'

When I index simple PDF files and try to fetch back the
content_type, I don't get the response:

gist:1280860 · GitHub

I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?

Thanks again !

Jérémie

Also, another quick question:

If my understanding is right, the original message will be stored in
_source with the "content" : base64. Will it really store the base64
encoded file without decoding it first ? This would be a HUGE storing
overhead compared to the bare original binary. (I'm okay for storing
the binary in ES as long as it's not stored in base64).

Thanks !

--
Jérémie 'ahFeel' BORDIER

Lukas_Vlcek1 · October 12, 2011, 8:14pm

Hi,

my understanding is that as of now it only allows to store base64 encoded
attachments in _source. That is what the incoming json document looks like
and thus that should be what the client gets back. Decoding it and encoding
internally would mean that the search operation would take longer (but may
be there are also other reasons why it is stored in base64). In the end of
the day you need to apply some encoding anyway because the data are
transmitted and replicated inside the cluster and between the cluster and
client.

Not sure if that is really problem from the disk space perspective, I do not
have that much attachments content to be even near to any index size
threshold. Also note, you can compress _source content:

On Wed, Oct 12, 2011 at 6:37 PM, Jérémie BORDIER ahfeel@gmail.com wrote:

On Wed, Oct 12, 2011 at 12:33 PM, jeremie.bordier@gmail.com
ahfeel@gmail.com wrote:

Hi !

First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.

I have a few questions about the attachment plugin:

I tried to configure the attachment mapping adding a "size" long
field but it always gets dropped. I had a look at the code and it
seems that the attachement type takes in account existing
configuration of the standard fields, but drops the other, Why ?

curl -X PUT "${host}/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"size" : { "type" : "long", "store" : yes }
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets",
"store":"yes" }
}
}
}
}
}'

When I index simple PDF files and try to fetch back the
content_type, I don't get the response:

gist:1280860 · GitHub

I think these sounds like candidates for a ticket? I remember looking at the
code back some time and noticed something similar but it was not a problem
for me at that time (probably that is why this plugin is experimental).

I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?

What if the document does not really have any title (not that Tika fails to
extract it) how can you then learn about this in case it is important? But I
am not saying that this was the original reasoning behind this logic.
However, you can always pull both fields [title, name] and do this magic on
client side.

Thanks again !

Jérémie

Also, another quick question:

If my understanding is right, the original message will be stored in
_source with the "content" : base64. Will it really store the base64
encoded file without decoding it first ? This would be a HUGE storing
overhead compared to the bare original binary. (I'm okay for storing
the binary in ES as long as it's not stored in base64).

Thanks !

--
Jérémie 'ahFeel' BORDIER

HTH,

Regards,
Lukas

kimchy · October 12, 2011, 9:40pm

On Wed, Oct 12, 2011 at 12:33 PM, jeremie.bordier@gmail.com <
ahfeel@gmail.com> wrote:

Hi !

First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.

I have a few questions about the attachment plugin:

I tried to configure the attachment mapping adding a "size" long
field but it always gets dropped. I had a look at the code and it
seems that the attachement type takes in account existing
configuration of the standard fields, but drops the other, Why ?

Because it only support specific "extra" fields associated with the
attachment, and indexing the size is not part of it. You can either add it
as a field yourself to the json doc, or, we can enhance the attachment
mapper to also index it.

curl -X PUT "${host}/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"size" : { "type" : "long", "store" : yes }
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets",
"store":"yes" }
}
}
}
}
}'

When I index simple PDF files and try to fetch back the
content_type, I don't get the response:

gist:1280860 · GitHub

You need to explicitly configure the content_type mapping and set "store" to
"yes".

I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?

Just to enhance Tika ability to guess the content type.

Thanks again !

Jérémie

Jeremie_BORDIER1 · October 12, 2011, 9:44pm

Hi Shay and Lukas,

Thanks for the answers I had a quick talk with Lukas on twitter so
I'll just post back on your last comments Shay.

When I index simple PDF files and try to fetch back the
content_type, I don't get the response:

gist:1280860 · GitHub

You need to explicitly configure the content_type mapping and set "store" to
"yes".

Isn't "file.content_type" supposed to fetch it from _source implicitly ?

I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?

Just to enhance Tika ability to guess the content type.

Ahh.. okay

Thanks,
Jérémie

kimchy · October 12, 2011, 9:44pm

On Wed, Oct 12, 2011 at 6:37 PM, Jérémie BORDIER ahfeel@gmail.com wrote:

On Wed, Oct 12, 2011 at 12:33 PM, jeremie.bordier@gmail.com
ahfeel@gmail.com wrote:

Hi !

First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.

I have a few questions about the attachment plugin:

I tried to configure the attachment mapping adding a "size" long
field but it always gets dropped. I had a look at the code and it
seems that the attachement type takes in account existing
configuration of the standard fields, but drops the other, Why ?

curl -X PUT "${host}/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"size" : { "type" : "long", "store" : yes }
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets",
"store":"yes" }
}
}
}
}
}'

When I index simple PDF files and try to fetch back the
content_type, I don't get the response:

gist:1280860 · GitHub

I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?

Thanks again !

Jérémie

Also, another quick question:

If my understanding is right, the original message will be stored in
_source with the "content" : base64. Will it really store the base64
encoded file without decoding it first ? This would be a HUGE storing
overhead compared to the bare original binary. (I'm okay for storing
the binary in ES as long as it's not stored in base64).

Yes, elasticsearch does not "touch" the _source and stores it "as is". You
can have the _source compressed though, or completely disable it (storing
the _source, and explicitly store, in the mapping, what you want).

Thanks !

--
Jérémie 'ahFeel' BORDIER

kimchy · October 12, 2011, 9:46pm

On Wed, Oct 12, 2011 at 11:44 PM, Jérémie BORDIER <jeremie.bordier@gmail.com

wrote:

Hi Shay and Lukas,

Thanks for the answers I had a quick talk with Lukas on twitter so
I'll just post back on your last comments Shay.

When I index simple PDF files and try to fetch back the
content_type, I don't get the response:

gist:1280860 · GitHub

You need to explicitly configure the content_type mapping and set "store"
to
"yes".

Isn't "file.content_type" supposed to fetch it from _source implicitly ?

No, _source is the json you provided, as is, without any changes. Obviously,
it does not contain the content_type, thats why you need to explicitly store
it in addition to the _source.

I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?

Just to enhance Tika ability to guess the content type.

Ahh.. okay

Thanks,
Jérémie

Jeremie_BORDIER1 · October 12, 2011, 9:49pm

On Wed, Oct 12, 2011 at 11:46 PM, Shay Banon kimchy@gmail.com wrote:

On Wed, Oct 12, 2011 at 11:44 PM, Jérémie BORDIER
jeremie.bordier@gmail.com wrote:

Hi Shay and Lukas,

Thanks for the answers I had a quick talk with Lukas on twitter so
I'll just post back on your last comments Shay.

When I index simple PDF files and try to fetch back the
content_type, I don't get the response:

gist:1280860 · GitHub

You need to explicitly configure the content_type mapping and set
"store" to
"yes".

Isn't "file.content_type" supposed to fetch it from _source implicitly ?

No, _source is the json you provided, as is, without any changes. Obviously,
it does not contain the content_type, thats why you need to explicitly store
it in addition to the _source.

Ah sorry, I realized I forgot to show the PUT request I made. Then,
last question, if I had a content_type in my source json, would the
"file.content_type" => "_source.file.content_type" be done
automatically ? It doesn't seems like, but I though I've seen
something like that in the documentation.

Jérémie

kimchy · October 14, 2011, 11:59am

On Wed, Oct 12, 2011 at 11:49 PM, Jérémie BORDIER <jeremie.bordier@gmail.com

wrote:

On Wed, Oct 12, 2011 at 11:46 PM, Shay Banon kimchy@gmail.com wrote:

On Wed, Oct 12, 2011 at 11:44 PM, Jérémie BORDIER
jeremie.bordier@gmail.com wrote:

Hi Shay and Lukas,

Thanks for the answers I had a quick talk with Lukas on twitter so
I'll just post back on your last comments Shay.

When I index simple PDF files and try to fetch back the
content_type, I don't get the response:

gist:1280860 · GitHub

You need to explicitly configure the content_type mapping and set
"store" to
"yes".

Isn't "file.content_type" supposed to fetch it from _source implicitly ?

No, _source is the json you provided, as is, without any changes.
Obviously,
it does not contain the content_type, thats why you need to explicitly
store
it in addition to the _source.

Ah sorry, I realized I forgot to show the PUT request I made. Then,
last question, if I had a content_type in my source json, would the
"file.content_type" => "_source.file.content_type" be done
automatically ? It doesn't seems like, but I though I've seen
something like that in the documentation.

If you have any other field in the json, then it will be indexed and you can
search on it, as well as get it back from the _source.

Jérémie

Topic		Replies	Views
Attachments plugin - has anyone been using this successfully? Elasticsearch	1	279	July 6, 2017
ES 2.0 and attachments plugin issue Elasticsearch	4	1110	July 5, 2017
[ANN] Elasticsearch Mapper Attachment plugin 2.2.0 released Elasticsearch	1	348	July 6, 2017
[ANN] Elasticsearch Mapper Attachment plugin 2.4.3 released Elasticsearch	1	336	July 6, 2017
[ANN] Elasticsearch Mapper Attachment plugin 2.4.2 released Elasticsearch	1	344	July 6, 2017

Attachment plugin: flexibility?

Related topics