How to specify file to Ingest Attachment

fjosef · February 21, 2017, 9:34am

Hi everyone!

I have a Wordpress website and replaced native search with ElasticSearch using ElasticPress plugin.

Every thing is working perfect, but now we want to index binary file contents (especially pdf). For testing, I'm using Kibana and all explained in documentation are working good.

Literally I read all the documentation and discussions about Ingest Attachment and was not able to find how I must pass pdf file itself.

All examples I found, using "data" field and passing base64 encoded text:

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

I checked Ingest Attachment plugin itself, it comes with Tika preinstalled and suppose to extract file content.

Also I read this one from Taylor Lovett, creator of ElasticPress. It is interesting topic.

Please someone give me more clear example, also do I really need to use Ingest or just pre-parse file contents, then index them.

warkolm · February 21, 2017, 9:36am

You need to base64 encode the entire PDF before you put if in that json format, then it is inserted as the data value.

fjosef · February 21, 2017, 9:40am

Thanks for your fast response!
But as you mentioned, still we need some heavy lifting. Then what is really benefits of Ingest?

Also can give me some example codes to do this?

warkolm · February 21, 2017, 9:44am

I have used the base64 command on the shell previously, I haven't really worked with large amounts of binary docs to automate it further sorry.

It can be used for more than PDFs

dadoonet · February 21, 2017, 9:54am

Have a look at FSCrawler project. It exposes a REST endpoint where you can simply upload your binary file.

fjosef · February 21, 2017, 9:58am

I don't have upload problem, they are already uploaded.

Also if other plugins needed, then what is the benefits of Ingest?

dadoonet · February 21, 2017, 10:10am

They are uploaded where?

fjosef · February 21, 2017, 10:14am

As I said, it is a Wordpress site and pdf files are attached to the post using custom fields.
So they are on the server and have a known path.

dadoonet · February 21, 2017, 10:43am

If you want to search for them, you need to index them.
One way or another you need to send their content to elasticsearch.

You can extract yourself the content and just send what you want to index to elasticsearch.
You can send the binary BASE64 to elasticsearch ingest which will do the extraction
You can send the binary to FSCrawler which will do the extraction before sending to elasticsearch

fjosef · February 21, 2017, 10:56am

Yes I know I must extract content and index binaries. Then from what I'v got, it isn't simply give file path to Ingest (of course after creating pipeline and mapping) and Ingest do the extraction?

Also binary must be BASE64 formatted?

dadoonet · February 21, 2017, 2:34pm

No. Elasticsearch never fetches data from a source. You have to push it.
Note that you can write your own plugin which fetches it if you wish.

system · March 21, 2017, 2:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Advantages of base64 encoded content in ingest attachment plugin Elasticsearch	3	1571	May 1, 2018
Using ingest-attachment plugin Elasticsearch	11	1237	December 21, 2016
Best way to use Ingest Attachment Plugin Elasticsearch ingest-pipeline	4	514	December 31, 2021
Search a PDF file using its content Elasticsearch	9	15788	February 11, 2019
Searching through PDF attachments and other documents in ElasticSearch with one query Elasticsearch	6	1704	October 29, 2020

How to specify file to Ingest Attachment

Related topics