I have a Wordpress website and replaced native search with ElasticSearch using ElasticPress plugin.
Every thing is working perfect, but now we want to index binary file contents (especially pdf). For testing, I'm using Kibana and all explained in documentation are working good.
Literally I read all the documentation and discussions about Ingest Attachment and was not able to find how I must pass pdf file itself.
All examples I found, using "data" field and passing base64 encoded text:
PUT my_index/my_type/my_id?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
I checked Ingest Attachment plugin itself, it comes with Tika preinstalled and suppose to extract file content.
Also I read this one from Taylor Lovett, creator of ElasticPress. It is interesting topic.
Please someone give me more clear example, also do I really need to use Ingest or just pre-parse file contents, then index them.
Yes I know I must extract content and index binaries. Then from what I'v got, it isn't simply give file path to Ingest (of course after creating pipeline and mapping) and Ingest do the extraction?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.