Re: Indexing binary files - how to store the extracted text only

Hello Tom,

Did you find a solution to this problem in the end?

Cheers,
Matt

On Tuesday, 15 January 2013 23:19:07 UTC, tom wrote:

Hello,

indexing binary files with the mapper attachments plugin works very fine.
Unfortunately, it always stores the total file content base64 encoded in
the
index where I just want to store only the text segments Tika extracts from
the file (I fetch the files from my local file system via the fsriver). Is
it possible to configure the attachment plugin appropriatley? Or are there
any other plugins/solutions for that task available?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Indexing-binary-files-how-to-store-the-extracted-text-only-tp4028232.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Are you really sure it does not work?

From the description at

"By default, 100000 characters are extracted when indexing the content.
This default value can be changed by setting the
index.mapping.attachment.indexed_chars setting. It can also be provided
on a per document indexed using the _indexed_chars parameter. -1 can be
set to extract all text, but note that all the text needs to be allowed
to be represented in memory."

Jörg

Am 16.07.13 13:39, schrieb matthew.ford@digital.cabinet-office.gov.uk:

Hello Tom,

Did you find a solution to this problem in the end?

Cheers,
Matt

On Tuesday, 15 January 2013 23:19:07 UTC, tom wrote:

Hello,

indexing binary files with the mapper attachments plugin works
very fine.
Unfortunately, it always stores the total file content base64
encoded in the
index where I just want to store only the text segments Tika
extracts from
the file (I fetch the files from my local file system via the
fsriver). Is
it possible to configure the attachment plugin appropriatley? Or
are there
any other plugins/solutions for that task available?




-- 
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Indexing-binary-files-how-to-store-the-extracted-text-only-tp4028232.html
<http://elasticsearch-users.115913.n3.nabble.com/Indexing-binary-files-how-to-store-the-extracted-text-only-tp4028232.html>

Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You do not want to store the base64 encoded String, which was sent to
ElasticSearch as the "content" field?
I think it is not possible yet.

The base64 encoded String is indexed anytime a json-document is sent with a
field called "content"

if (name.equals(propName)) {
// that is the content

https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/src/main/java/org/elasticsearch/index/mapper/attachment/AttachmentMapper.java

Please correct me, if i am wrong

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Christian,

This is what I'm trying to achieve, with close to 50GB of pfds most of
which contain images, I would love to be able to just store the indexed
text rather than the attachment itself. In my ideal world the base64
attachment is sent, processed but not stored in ES, sounds like this isn't
possible and I'll have to do it before we send the document to the index.

As an aside I'm also somewhat confused if the file field is called 'file'
or 'content', file seemed to work for me, but perhaps this changed in more
recent versions of the plugin.

Many Thanks,
Matthew Ford

On Tuesday, 16 July 2013 14:36:40 UTC+1, Christian Th. wrote:

You do not want to store the base64 encoded String, which was sent to
Elasticsearch as the "content" field?
I think it is not possible yet.

The base64 encoded String is indexed anytime a json-document is sent with
a field called "content"

if (name.equals(propName)) {
// that is the content

https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/src/main/java/org/elasticsearch/index/mapper/attachment/AttachmentMapper.java

Please correct me, if i am wrong

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sorry for giving the wrong answer. It is possible.
The extracted content from Tika is added to the "_all" field and added to
the "file" field. The "_source" field is filled with the base64 String.
One possibility is to disable the "_source" field for your attachment type.

{
"person" : {
"_source" : {"enabled" : false},
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"file" : {"store" : "yes"}
}
}
}
}
}

Am Dienstag, 16. Juli 2013 16:20:10 UTC+2 schrieb
matthe...@digital.cabinet-office.gov.uk:

Hi Christian,

This is what I'm trying to achieve, with close to 50GB of pfds most of
which contain images, I would love to be able to just store the indexed
text rather than the attachment itself. In my ideal world the base64
attachment is sent, processed but not stored in ES, sounds like this isn't
possible and I'll have to do it before we send the document to the index.

As an aside I'm also somewhat confused if the file field is called 'file'
or 'content', file seemed to work for me, but perhaps this changed in more
recent versions of the plugin.

Many Thanks,
Matthew Ford

On Tuesday, 16 July 2013 14:36:40 UTC+1, Christian Th. wrote:

You do not want to store the base64 encoded String, which was sent to
Elasticsearch as the "content" field?
I think it is not possible yet.

The base64 encoded String is indexed anytime a json-document is sent with
a field called "content"

if (name.equals(propName)) {
// that is the content

https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/src/main/java/org/elasticsearch/index/mapper/attachment/AttachmentMapper.java

Please correct me, if i am wrong

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sorry for the wrong answer. It is possible.
The extracted content from Tika is added to the "_all" field and added to
the "file" field. The "_source" field is filled with the base64 String.
One possibility is to disable the "_source" field for your attachment type.

{
"person" : {
"_source" : {"enabled" : false},
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"file" : {"store" : "yes"}
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

how about

curl -X DELETE localhost:9200/test
curl -X PUT localhost:9200/test

curl -X PUT localhost:9200/test/person/_mapping -d '
{
"person" : {
"_source" : { "excludes" : [ "my_attachment" ] },
"properties" : {
"my_attachment" : { "type" : "attachment" }
}
}
}'

base64 string is: Hello World

curl -X PUT 'localhost:9200/test/person/1?refresh=true' -d '{
"name" : "my person",
"my_attachment" : "SGVsbG8gV29ybGQK"
}'

this query returns only the name field, but searches in the attachment

curl -X POST 'localhost:9200/test/person/_search' -d '{
"query" : { "match" : { "my_attachment" : "hello" } }
}'

hope this helps

--Alex

On Tue, Jul 16, 2013 at 11:36 PM, Christian Th. chth.exensio@gmail.comwrote:

Sorry for the wrong answer. It is possible.
The extracted content from Tika is added to the "_all" field and added to
the "file" field. The "_source" field is filled with the base64 String.
One possibility is to disable the "_source" field for your attachment
type.

{
"person" : {
"_source" : {"enabled" : false},
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"file" : {"store" : "yes"}
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

An other possibility is the combination of the two answers, if you want to
store the field and not only search on it.
Disable the _source for the field of type attachment. But store the
extracted text, so it can be retrieved. Fields are not stored by default.

curl -X PUT localhost:9200/test/person/_mapping -d '
{
"person": {
"_source": {
"excludes": [
"my_attachment"
]
},
"properties": {
"file": {
"type": "attachment",
"fields": {
"file": {
"store": "yes"
}
}
}
}
}
}'

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

An other possibility is the combination of the two answers, if you want to
store the field and not only search on it.
Disable the _source for the field of type attachment. But store the
extracted text, so it can be retrieved. Fields are not stored by default.

curl -X PUT localhost:9200/test/person/_mapping -d '
{
"person": {
"_source": {
"excludes": [
"file"
]
},
"properties": {
"file": {
"type": "attachment",
"fields": {
"file": {
"store": "yes"
}
}
}
}
}
}'

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.