First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.
I have a few questions about the attachment plugin:
I tried to configure the attachment mapping adding a "size" long
field but it always gets dropped. I had a look at the code and it
seems that the attachement type takes in account existing
configuration of the standard fields, but drops the other, Why ?
When I index simple PDF files and try to fetch back the
content_type, I don't get the response:
I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?
First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.
I have a few questions about the attachment plugin:
I tried to configure the attachment mapping adding a "size" long
field but it always gets dropped. I had a look at the code and it
seems that the attachement type takes in account existing
configuration of the standard fields, but drops the other, Why ?
I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?
Thanks again !
Jérémie
Also, another quick question:
If my understanding is right, the original message will be stored in
_source with the "content" : base64. Will it really store the base64
encoded file without decoding it first ? This would be a HUGE storing
overhead compared to the bare original binary. (I'm okay for storing
the binary in ES as long as it's not stored in base64).
my understanding is that as of now it only allows to store base64 encoded
attachments in _source. That is what the incoming json document looks like
and thus that should be what the client gets back. Decoding it and encoding
internally would mean that the search operation would take longer (but may
be there are also other reasons why it is stored in base64). In the end of
the day you need to apply some encoding anyway because the data are
transmitted and replicated inside the cluster and between the cluster and
client.
Not sure if that is really problem from the disk space perspective, I do not
have that much attachments content to be even near to any index size
threshold. Also note, you can compress _source content:
On Wed, Oct 12, 2011 at 6:37 PM, Jérémie BORDIER ahfeel@gmail.com wrote:
First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.
I have a few questions about the attachment plugin:
I tried to configure the attachment mapping adding a "size" long
field but it always gets dropped. I had a look at the code and it
seems that the attachement type takes in account existing
configuration of the standard fields, but drops the other, Why ?
I think these sounds like candidates for a ticket? I remember looking at the
code back some time and noticed something similar but it was not a problem
for me at that time (probably that is why this plugin is experimental).
I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?
What if the document does not really have any title (not that Tika fails to
extract it) how can you then learn about this in case it is important? But I
am not saying that this was the original reasoning behind this logic.
However, you can always pull both fields [title, name] and do this magic on
client side.
Thanks again !
Jérémie
Also, another quick question:
If my understanding is right, the original message will be stored in
_source with the "content" : base64. Will it really store the base64
encoded file without decoding it first ? This would be a HUGE storing
overhead compared to the bare original binary. (I'm okay for storing
the binary in ES as long as it's not stored in base64).
First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.
I have a few questions about the attachment plugin:
I tried to configure the attachment mapping adding a "size" long
field but it always gets dropped. I had a look at the code and it
seems that the attachement type takes in account existing
configuration of the standard fields, but drops the other, Why ?
Because it only support specific "extra" fields associated with the
attachment, and indexing the size is not part of it. You can either add it
as a field yourself to the json doc, or, we can enhance the attachment
mapper to also index it.
You need to explicitly configure the content_type mapping and set "store" to
"yes".
I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?
Just to enhance Tika ability to guess the content type.
You need to explicitly configure the content_type mapping and set "store" to
"yes".
Isn't "file.content_type" supposed to fetch it from _source implicitly ?
I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?
Just to enhance Tika ability to guess the content type.
First, many thanks for the great work done on the attachment plugin
and on elasticsearch in general.
I have a few questions about the attachment plugin:
I tried to configure the attachment mapping adding a "size" long
field but it always gets dropped. I had a look at the code and it
seems that the attachement type takes in account existing
configuration of the standard fields, but drops the other, Why ?
I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?
Thanks again !
Jérémie
Also, another quick question:
If my understanding is right, the original message will be stored in
_source with the "content" : base64. Will it really store the base64
encoded file without decoding it first ? This would be a HUGE storing
overhead compared to the bare original binary. (I'm okay for storing
the binary in ES as long as it's not stored in base64).
Yes, elasticsearch does not "touch" the _source and stores it "as is". You
can have the _source compressed though, or completely disable it (storing
the _source, and explicitly store, in the mapping, what you want).
You need to explicitly configure the content_type mapping and set "store"
to
"yes".
Isn't "file.content_type" supposed to fetch it from _source implicitly ?
No, _source is the json you provided, as is, without any changes. Obviously,
it does not contain the content_type, thats why you need to explicitly store
it in addition to the _source.
I saw in the source that we could hint the document name using
_name, but it doesn't seem to be used as an alternative title for the
document if tika don't manage to extract one from the metadata.
Wouldn't that make sense ? What is it used for then ?
Just to enhance Tika ability to guess the content type.
You need to explicitly configure the content_type mapping and set
"store" to
"yes".
Isn't "file.content_type" supposed to fetch it from _source implicitly ?
No, _source is the json you provided, as is, without any changes. Obviously,
it does not contain the content_type, thats why you need to explicitly store
it in addition to the _source.
Ah sorry, I realized I forgot to show the PUT request I made. Then,
last question, if I had a content_type in my source json, would the
"file.content_type" => "_source.file.content_type" be done
automatically ? It doesn't seems like, but I though I've seen
something like that in the documentation.
You need to explicitly configure the content_type mapping and set
"store" to
"yes".
Isn't "file.content_type" supposed to fetch it from _source implicitly ?
No, _source is the json you provided, as is, without any changes.
Obviously,
it does not contain the content_type, thats why you need to explicitly
store
it in addition to the _source.
Ah sorry, I realized I forgot to show the PUT request I made. Then,
last question, if I had a content_type in my source json, would the
"file.content_type" => "_source.file.content_type" be done
automatically ? It doesn't seems like, but I though I've seen
something like that in the documentation.
If you have any other field in the json, then it will be indexed and you can
search on it, as well as get it back from the _source.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.