Mapping for indexed attachments without storing them


(Pavel Hloušek) #1

Hello,

I'm new to elasticsearch and I'm currently playing with mapper-attachments
plugin. My idea is to use ES to index file contents and NOT to store the
file contents. Is it possible? What would the mapping be? Cannot get it
working...

Thanks for help

Pavel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d72c5f4b-bc21-4e95-9042-00cde1535789%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Pavel Hloušek) #2

I forgot to mention that I used the latest releases of both ES and the
mapper-attchments plugin.

Dne pátek, 7. března 2014 16:37:51 UTC+1 Pavel Hloušek napsal(a):

Hello,

I'm new to elasticsearch and I'm currently playing with mapper-attachments
plugin. My idea is to use ES to index file contents and NOT to store the
file contents. Is it possible? What would the mapping be? Cannot get it
working...

Thanks for help

Pavel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/38ed817d-3ed5-47da-8498-41aebe692a22%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #3

The best idea is not to use mapper attachment and extract content before indexing using Apache Tika for example, which is used as well in mapper attachment.
That's what I did in FSRiver project. At first I was using mapper attachment but definitely not using it gives more flexibility and more control of what you are sending over the wire.

That said, you can use exclude to remove from _source some content.
See: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-source-field.html#include-exclude

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 7 mars 2014 à 16:59:23, Pavel Hloušek (pavel.hlousek@gmail.com) a écrit:

I forgot to mention that I used the latest releases of both ES and the mapper-attchments plugin.

Dne pátek, 7. března 2014 16:37:51 UTC+1 Pavel Hloušek napsal(a):
Hello,

I'm new to elasticsearch and I'm currently playing with mapper-attachments plugin. My idea is to use ES to index file contents and NOT to store the file contents. Is it possible? What would the mapping be? Cannot get it working...

Thanks for help

Pavel

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/38ed817d-3ed5-47da-8498-41aebe692a22%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.5319edb4.12200854.9291%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/d/optout.


(Pavel Hloušek) #4

Here is a gist that reproduces my steps:

Dne pátek, 7. března 2014 16:37:51 UTC+1 Pavel Hloušek napsal(a):

Hello,

I'm new to elasticsearch and I'm currently playing with mapper-attachments
plugin. My idea is to use ES to index file contents and NOT to store the
file contents. Is it possible? What would the mapping be? Cannot get it
working...

Thanks for help

Pavel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/88a8ad46-0d60-462d-b914-75a096830208%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Pavel Hloušek) #5

Thanks for your answer. I get your point. However, for a PHP project it
seemed easier to user mapper attachment. Otherwise there is need to keep
Apache Tika server running or run it each time from cli which is rather
expensive.

Still, exclude is nice but I'd like to share space on servers - there can
be a lot of content.

Pavel

Dne pátek, 7. března 2014 17:03:00 UTC+1 David Pilato napsal(a):

The best idea is not to use mapper attachment and extract content before
indexing using Apache Tika for example, which is used as well in mapper
attachment.
That's what I did in FSRiver project. At first I was using mapper
attachment but definitely not using it gives more flexibility and more
control of what you are sending over the wire.

That said, you can use exclude to remove from _source some content.
See:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-source-field.html#include-exclude

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 7 mars 2014 à 16:59:23, Pavel Hloušek (pavel....@gmail.com<javascript:>)
a écrit:

I forgot to mention that I used the latest releases of both ES and the
mapper-attchments plugin.

Dne pátek, 7. března 2014 16:37:51 UTC+1 Pavel Hloušek napsal(a):

Hello,

I'm new to elasticsearch and I'm currently playing with
mapper-attachments plugin. My idea is to use ES to index file contents and
NOT to store the file contents. Is it possible? What would the mapping be?
Cannot get it working...

Thanks for help

Pavel

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/38ed817d-3ed5-47da-8498-41aebe692a22%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/38ed817d-3ed5-47da-8498-41aebe692a22%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a798dab0-3bee-40ae-a200-55a4173b6c71%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #6

What's wrong with exclude?
It does the job, right?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 7 mars 2014 à 17:18, Pavel Hloušek pavel.hlousek@gmail.com a écrit :

Thanks for your answer. I get your point. However, for a PHP project it seemed easier to user mapper attachment. Otherwise there is need to keep Apache Tika server running or run it each time from cli which is rather expensive.

Still, exclude is nice but I'd like to share space on servers - there can be a lot of content.

Pavel

Dne pátek, 7. března 2014 17:03:00 UTC+1 David Pilato napsal(a):

The best idea is not to use mapper attachment and extract content before indexing using Apache Tika for example, which is used as well in mapper attachment.
That's what I did in FSRiver project. At first I was using mapper attachment but definitely not using it gives more flexibility and more control of what you are sending over the wire.

That said, you can use exclude to remove from _source some content.
See: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-source-field.html#include-exclude

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 7 mars 2014 à 16:59:23, Pavel Hloušek (pavel....@gmail.com) a écrit:

I forgot to mention that I used the latest releases of both ES and the mapper-attchments plugin.

Dne pátek, 7. března 2014 16:37:51 UTC+1 Pavel Hloušek napsal(a):

Hello,

I'm new to elasticsearch and I'm currently playing with mapper-attachments plugin. My idea is to use ES to index file contents and NOT to store the file contents. Is it possible? What would the mapping be? Cannot get it working...

Thanks for help

Pavel
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/38ed817d-3ed5-47da-8498-41aebe692a22%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a798dab0-3bee-40ae-a200-55a4173b6c71%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/DB4E69A9-EC1A-4FF5-B2FC-E0CE03D88F1F%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(Vorou) #7

Sorry to ressurect an old topic.

Exclude did the job, thank you. But could you please elaborate on why the plugin stores the base64 by default? I guess you'll need it only if something in Tika extracting process changed. Are there any other reasons why shouldn't I disable it?


(David Pilato) #8

It was by design. Note that it will change with 5.0.0 with this https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment.html


(system) #9