Attachment plugin: flexibility?

Hi !

First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.

I have a few questions about the attachment plugin:

  • I tried to configure the attachment mapping adding a "size" long
    field but it always gets dropped. I had a look at the code and it
    seems that the attachement type takes in account existing
    configuration of the standard fields, but drops the other, Why ?

curl -X PUT "${host}/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"size" : { "type" : "long", "store" : yes }
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets",
"store":"yes" }
}
}
}
}
}'

  • When I index simple PDF files and try to fetch back the
    content_type, I don't get the response:
  • I saw in the source that we could hint the document name using
    _name, but it doesn't seem to be used as an alternative title for the
    document if tika don't manage to extract one from the metadata.
    Wouldn't that make sense ? What is it used for then ?

Thanks again !

Jérémie

On Wed, Oct 12, 2011 at 12:33 PM, jeremie.bordier@gmail.com
ahfeel@gmail.com wrote:

Hi !

First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.

I have a few questions about the attachment plugin:

  • I tried to configure the attachment mapping adding a "size" long
    field but it always gets dropped. I had a look at the code and it
    seems that the attachement type takes in account existing
    configuration of the standard fields, but drops the other, Why ?

curl -X PUT "${host}/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"size" : { "type" : "long", "store" : yes }
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets",
"store":"yes" }
}
}
}
}
}'

  • When I index simple PDF files and try to fetch back the
    content_type, I don't get the response:

gist:1280860 · GitHub

  • I saw in the source that we could hint the document name using
    _name, but it doesn't seem to be used as an alternative title for the
    document if tika don't manage to extract one from the metadata.
    Wouldn't that make sense ? What is it used for then ?

Thanks again !

Jérémie

Also, another quick question:

  • If my understanding is right, the original message will be stored in
    _source with the "content" : base64. Will it really store the base64
    encoded file without decoding it first ? This would be a HUGE storing
    overhead compared to the bare original binary. (I'm okay for storing
    the binary in ES as long as it's not stored in base64).

Thanks !

--
Jérémie 'ahFeel' BORDIER

Hi,

my understanding is that as of now it only allows to store base64 encoded
attachments in _source. That is what the incoming json document looks like
and thus that should be what the client gets back. Decoding it and encoding
internally would mean that the search operation would take longer (but may
be there are also other reasons why it is stored in base64). In the end of
the day you need to apply some encoding anyway because the data are
transmitted and replicated inside the cluster and between the cluster and
client.

Not sure if that is really problem from the disk space perspective, I do not
have that much attachments content to be even near to any index size
threshold. Also note, you can compress _source content:

On Wed, Oct 12, 2011 at 6:37 PM, Jérémie BORDIER ahfeel@gmail.com wrote:

On Wed, Oct 12, 2011 at 12:33 PM, jeremie.bordier@gmail.com
ahfeel@gmail.com wrote:

Hi !

First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.

I have a few questions about the attachment plugin:

  • I tried to configure the attachment mapping adding a "size" long
    field but it always gets dropped. I had a look at the code and it
    seems that the attachement type takes in account existing
    configuration of the standard fields, but drops the other, Why ?

curl -X PUT "${host}/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"size" : { "type" : "long", "store" : yes }
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets",
"store":"yes" }
}
}
}
}
}'

  • When I index simple PDF files and try to fetch back the
    content_type, I don't get the response:

gist:1280860 · GitHub

I think these sounds like candidates for a ticket? I remember looking at the
code back some time and noticed something similar but it was not a problem
for me at that time (probably that is why this plugin is experimental).

  • I saw in the source that we could hint the document name using
    _name, but it doesn't seem to be used as an alternative title for the
    document if tika don't manage to extract one from the metadata.
    Wouldn't that make sense ? What is it used for then ?

What if the document does not really have any title (not that Tika fails to
extract it) how can you then learn about this in case it is important? But I
am not saying that this was the original reasoning behind this logic.
However, you can always pull both fields [title, name] and do this magic on
client side.

Thanks again !

Jérémie

Also, another quick question:

  • If my understanding is right, the original message will be stored in
    _source with the "content" : base64. Will it really store the base64
    encoded file without decoding it first ? This would be a HUGE storing
    overhead compared to the bare original binary. (I'm okay for storing
    the binary in ES as long as it's not stored in base64).

Thanks !

--
Jérémie 'ahFeel' BORDIER

HTH,

Regards,
Lukas

On Wed, Oct 12, 2011 at 12:33 PM, jeremie.bordier@gmail.com <
ahfeel@gmail.com> wrote:

Hi !

First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.

I have a few questions about the attachment plugin:

  • I tried to configure the attachment mapping adding a "size" long
    field but it always gets dropped. I had a look at the code and it
    seems that the attachement type takes in account existing
    configuration of the standard fields, but drops the other, Why ?

Because it only support specific "extra" fields associated with the
attachment, and indexing the size is not part of it. You can either add it
as a field yourself to the json doc, or, we can enhance the attachment
mapper to also index it.

curl -X PUT "${host}/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"size" : { "type" : "long", "store" : yes }
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets",
"store":"yes" }
}
}
}
}
}'

  • When I index simple PDF files and try to fetch back the
    content_type, I don't get the response:

gist:1280860 · GitHub

You need to explicitly configure the content_type mapping and set "store" to
"yes".

  • I saw in the source that we could hint the document name using
    _name, but it doesn't seem to be used as an alternative title for the
    document if tika don't manage to extract one from the metadata.
    Wouldn't that make sense ? What is it used for then ?

Just to enhance Tika ability to guess the content type.

Thanks again !

Jérémie

Hi Shay and Lukas,

Thanks for the answers :slight_smile: I had a quick talk with Lukas on twitter so
I'll just post back on your last comments Shay.

  • When I index simple PDF files and try to fetch back the
    content_type, I don't get the response:

gist:1280860 · GitHub

You need to explicitly configure the content_type mapping and set "store" to
"yes".

Isn't "file.content_type" supposed to fetch it from _source implicitly ?

  • I saw in the source that we could hint the document name using
    _name, but it doesn't seem to be used as an alternative title for the
    document if tika don't manage to extract one from the metadata.
    Wouldn't that make sense ? What is it used for then ?

Just to enhance Tika ability to guess the content type.

Ahh.. okay :slight_smile:

Thanks,
Jérémie

On Wed, Oct 12, 2011 at 6:37 PM, Jérémie BORDIER ahfeel@gmail.com wrote:

On Wed, Oct 12, 2011 at 12:33 PM, jeremie.bordier@gmail.com
ahfeel@gmail.com wrote:

Hi !

First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.

I have a few questions about the attachment plugin:

  • I tried to configure the attachment mapping adding a "size" long
    field but it always gets dropped. I had a look at the code and it
    seems that the attachement type takes in account existing
    configuration of the standard fields, but drops the other, Why ?

curl -X PUT "${host}/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"size" : { "type" : "long", "store" : yes }
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets",
"store":"yes" }
}
}
}
}
}'

  • When I index simple PDF files and try to fetch back the
    content_type, I don't get the response:

gist:1280860 · GitHub

  • I saw in the source that we could hint the document name using
    _name, but it doesn't seem to be used as an alternative title for the
    document if tika don't manage to extract one from the metadata.
    Wouldn't that make sense ? What is it used for then ?

Thanks again !

Jérémie

Also, another quick question:

  • If my understanding is right, the original message will be stored in
    _source with the "content" : base64. Will it really store the base64
    encoded file without decoding it first ? This would be a HUGE storing
    overhead compared to the bare original binary. (I'm okay for storing
    the binary in ES as long as it's not stored in base64).

Yes, elasticsearch does not "touch" the _source and stores it "as is". You
can have the _source compressed though, or completely disable it (storing
the _source, and explicitly store, in the mapping, what you want).

Thanks !

--
Jérémie 'ahFeel' BORDIER

On Wed, Oct 12, 2011 at 11:44 PM, Jérémie BORDIER <jeremie.bordier@gmail.com

wrote:

Hi Shay and Lukas,

Thanks for the answers :slight_smile: I had a quick talk with Lukas on twitter so
I'll just post back on your last comments Shay.

  • When I index simple PDF files and try to fetch back the
    content_type, I don't get the response:

gist:1280860 · GitHub

You need to explicitly configure the content_type mapping and set "store"
to
"yes".

Isn't "file.content_type" supposed to fetch it from _source implicitly ?

No, _source is the json you provided, as is, without any changes. Obviously,
it does not contain the content_type, thats why you need to explicitly store
it in addition to the _source.

  • I saw in the source that we could hint the document name using
    _name, but it doesn't seem to be used as an alternative title for the
    document if tika don't manage to extract one from the metadata.
    Wouldn't that make sense ? What is it used for then ?

Just to enhance Tika ability to guess the content type.

Ahh.. okay :slight_smile:

Thanks,
Jérémie

On Wed, Oct 12, 2011 at 11:46 PM, Shay Banon kimchy@gmail.com wrote:

On Wed, Oct 12, 2011 at 11:44 PM, Jérémie BORDIER
jeremie.bordier@gmail.com wrote:

Hi Shay and Lukas,

Thanks for the answers :slight_smile: I had a quick talk with Lukas on twitter so
I'll just post back on your last comments Shay.

  • When I index simple PDF files and try to fetch back the
    content_type, I don't get the response:

gist:1280860 · GitHub

You need to explicitly configure the content_type mapping and set
"store" to
"yes".

Isn't "file.content_type" supposed to fetch it from _source implicitly ?

No, _source is the json you provided, as is, without any changes. Obviously,
it does not contain the content_type, thats why you need to explicitly store
it in addition to the _source.

Ah sorry, I realized I forgot to show the PUT request I made. Then,
last question, if I had a content_type in my source json, would the
"file.content_type" => "_source.file.content_type" be done
automatically ? It doesn't seems like, but I though I've seen
something like that in the documentation.

Jérémie

On Wed, Oct 12, 2011 at 11:49 PM, Jérémie BORDIER <jeremie.bordier@gmail.com

wrote:

On Wed, Oct 12, 2011 at 11:46 PM, Shay Banon kimchy@gmail.com wrote:

On Wed, Oct 12, 2011 at 11:44 PM, Jérémie BORDIER
jeremie.bordier@gmail.com wrote:

Hi Shay and Lukas,

Thanks for the answers :slight_smile: I had a quick talk with Lukas on twitter so
I'll just post back on your last comments Shay.

  • When I index simple PDF files and try to fetch back the
    content_type, I don't get the response:

gist:1280860 · GitHub

You need to explicitly configure the content_type mapping and set
"store" to
"yes".

Isn't "file.content_type" supposed to fetch it from _source implicitly ?

No, _source is the json you provided, as is, without any changes.
Obviously,
it does not contain the content_type, thats why you need to explicitly
store
it in addition to the _source.

Ah sorry, I realized I forgot to show the PUT request I made. Then,
last question, if I had a content_type in my source json, would the
"file.content_type" => "_source.file.content_type" be done
automatically ? It doesn't seems like, but I though I've seen
something like that in the documentation.

If you have any other field in the json, then it will be indexed and you can
search on it, as well as get it back from the _source.

Jérémie