Mapper-attachment and base64 encoding

Hello,

I am trying to index a large amount of data with the mapper attachment so I
can get the contents of the documents. My plan of attack is to write a php
script that encodes all my files to base64. Then input a doc to see if it
works. My question relates to inputing these docs into elasticsearch in
bulk. As I said I have a large amount of data so inputing each document in
the curl wouldn't be practical. Also does the bulk API support the
mapper-attachment? Also does anyone have experience with the
mapper-attachment plugin working well with their elasticsearch instance?

Thanks,
Austin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6e857c9f-d5b3-43f2-810e-699ddbdedab1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Definitely better to do that externally instead of moving a large BASE64 encoded document over the wire.
So I won’t do it as you plan to do it. It will consume a lot of memory and bandwidth.

We have some plans in the future to add this to logstash but it’s not here yet.

Now, answering some other questions.

Yes bulk supports whatever JSON document you send. If you have defined a mapping using mapper attachment, then Tika will be used to decode your BASE64 content wherever it comes from bulk or from index API.
In that case, you need to think of sending lower bulk requests (lower the number of documents per bulk request).

Best

--
David Pilato - Developer | Evangelist

@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 13 mars 2015 à 09:03, Austin Harmon aharmon2165@gmail.com a écrit :

Hello,

I am trying to index a large amount of data with the mapper attachment so I can get the contents of the documents. My plan of attack is to write a php script that encodes all my files to base64. Then input a doc to see if it works. My question relates to inputing these docs into elasticsearch in bulk. As I said I have a large amount of data so inputing each document in the curl wouldn't be practical. Also does the bulk API support the mapper-attachment? Also does anyone have experience with the mapper-attachment plugin working well with their elasticsearch instance?

Thanks,
Austin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6e857c9f-d5b3-43f2-810e-699ddbdedab1%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/6e857c9f-d5b3-43f2-810e-699ddbdedab1%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/C18821B0-FEE5-4D46-9685-D3788C43E97F%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

David,

I saw your post https://github.com/dadoonet/fscrawler and have couple of questions:

  1. fscrawler will search the directory and index the documents in ES but where is the base64 encoding happen? The mapper-attachment plugin requires all the input in base64 only, correct?

  2. Any plans to add this feature directly to Logstash i.e. read any PDF/DOC file and convert it to base-64 or JSON so that ES can parse it?

thanks,
Meenal

You should definitely open a new thread instead of adding questions to a very old thread.

  1. fscrawler does not use mapper-attachements plugin. It sends to elasticsearch already extracted text and meta-data.

  2. Have a look at what is happening there: https://github.com/elastic/elasticsearch/pull/16490
    I tried myself to build a codec-tika for logstash but I don't think it will make sense now.

I assume that something like FileBeat + ingest-attachment could replace at some point the fscrawler project.