Did you find a solution to this problem in the end?
Cheers,
Matt
On Tuesday, 15 January 2013 23:19:07 UTC, tom wrote:
Hello,
indexing binary files with the mapper attachments plugin works very fine.
Unfortunately, it always stores the total file content base64 encoded in
the
index where I just want to store only the text segments Tika extracts from
the file (I fetch the files from my local file system via the fsriver). Is
it possible to configure the attachment plugin appropriatley? Or are there
any other plugins/solutions for that task available?
"By default, 100000 characters are extracted when indexing the content.
This default value can be changed by setting the
index.mapping.attachment.indexed_chars setting. It can also be provided
on a per document indexed using the _indexed_chars parameter. -1 can be
set to extract all text, but note that all the text needs to be allowed
to be represented in memory."
Did you find a solution to this problem in the end?
Cheers,
Matt
On Tuesday, 15 January 2013 23:19:07 UTC, tom wrote:
Hello,
indexing binary files with the mapper attachments plugin works
very fine.
Unfortunately, it always stores the total file content base64
encoded in the
index where I just want to store only the text segments Tika
extracts from
the file (I fetch the files from my local file system via the
fsriver). Is
it possible to configure the attachment plugin appropriatley? Or
are there
any other plugins/solutions for that task available?
--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Indexing-binary-files-how-to-store-the-extracted-text-only-tp4028232.html
<http://elasticsearch-users.115913.n3.nabble.com/Indexing-binary-files-how-to-store-the-extracted-text-only-tp4028232.html>
Sent from the ElasticSearch Users mailing list archive at Nabble.com.
This is what I'm trying to achieve, with close to 50GB of pfds most of
which contain images, I would love to be able to just store the indexed
text rather than the attachment itself. In my ideal world the base64
attachment is sent, processed but not stored in ES, sounds like this isn't
possible and I'll have to do it before we send the document to the index.
As an aside I'm also somewhat confused if the file field is called 'file'
or 'content', file seemed to work for me, but perhaps this changed in more
recent versions of the plugin.
Many Thanks,
Matthew Ford
On Tuesday, 16 July 2013 14:36:40 UTC+1, Christian Th. wrote:
You do not want to store the base64 encoded String, which was sent to
Elasticsearch as the "content" field?
I think it is not possible yet.
The base64 encoded String is indexed anytime a json-document is sent with
a field called "content"
if (name.equals(propName)) {
// that is the content
Sorry for giving the wrong answer. It is possible.
The extracted content from Tika is added to the "_all" field and added to
the "file" field. The "_source" field is filled with the base64 String.
One possibility is to disable the "_source" field for your attachment type.
This is what I'm trying to achieve, with close to 50GB of pfds most of
which contain images, I would love to be able to just store the indexed
text rather than the attachment itself. In my ideal world the base64
attachment is sent, processed but not stored in ES, sounds like this isn't
possible and I'll have to do it before we send the document to the index.
As an aside I'm also somewhat confused if the file field is called 'file'
or 'content', file seemed to work for me, but perhaps this changed in more
recent versions of the plugin.
Many Thanks,
Matthew Ford
On Tuesday, 16 July 2013 14:36:40 UTC+1, Christian Th. wrote:
You do not want to store the base64 encoded String, which was sent to
Elasticsearch as the "content" field?
I think it is not possible yet.
The base64 encoded String is indexed anytime a json-document is sent with
a field called "content"
if (name.equals(propName)) {
// that is the content
Sorry for the wrong answer. It is possible.
The extracted content from Tika is added to the "_all" field and added to
the "file" field. The "_source" field is filled with the base64 String.
One possibility is to disable the "_source" field for your attachment type.
Sorry for the wrong answer. It is possible.
The extracted content from Tika is added to the "_all" field and added to
the "file" field. The "_source" field is filled with the base64 String.
One possibility is to disable the "_source" field for your attachment
type.
An other possibility is the combination of the two answers, if you want to
store the field and not only search on it.
Disable the _source for the field of type attachment. But store the
extracted text, so it can be retrieved. Fields are not stored by default.
An other possibility is the combination of the two answers, if you want to
store the field and not only search on it.
Disable the _source for the field of type attachment. But store the
extracted text, so it can be retrieved. Fields are not stored by default.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.