I am trying to index a large amount of data with the mapper attachment so I
can get the contents of the documents. My plan of attack is to write a php
script that encodes all my files to base64. Then input a doc to see if it
works. My question relates to inputing these docs into elasticsearch in
bulk. As I said I have a large amount of data so inputing each document in
the curl wouldn't be practical. Also does the bulk API support the
mapper-attachment? Also does anyone have experience with the
mapper-attachment plugin working well with their elasticsearch instance?
Definitely better to do that externally instead of moving a large BASE64 encoded document over the wire.
So I won’t do it as you plan to do it. It will consume a lot of memory and bandwidth.
We have some plans in the future to add this to logstash but it’s not here yet.
Now, answering some other questions.
Yes bulk supports whatever JSON document you send. If you have defined a mapping using mapper attachment, then Tika will be used to decode your BASE64 content wherever it comes from bulk or from index API.
In that case, you need to think of sending lower bulk requests (lower the number of documents per bulk request).
I am trying to index a large amount of data with the mapper attachment so I can get the contents of the documents. My plan of attack is to write a php script that encodes all my files to base64. Then input a doc to see if it works. My question relates to inputing these docs into elasticsearch in bulk. As I said I have a large amount of data so inputing each document in the curl wouldn't be practical. Also does the bulk API support the mapper-attachment? Also does anyone have experience with the mapper-attachment plugin working well with their elasticsearch instance?
fscrawler will search the directory and index the documents in ES but where is the base64 encoding happen? The mapper-attachment plugin requires all the input in base64 only, correct?
Any plans to add this feature directly to Logstash i.e. read any PDF/DOC file and convert it to base-64 or JSON so that ES can parse it?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.