How to specify file to Ingest Attachment

Hi everyone!

I have a Wordpress website and replaced native search with ElasticSearch using ElasticPress plugin.

Every thing is working perfect, but now we want to index binary file contents (especially pdf). For testing, I'm using Kibana and all explained in documentation are working good.

Literally I read all the documentation and discussions about Ingest Attachment and was not able to find how I must pass pdf file itself.

All examples I found, using "data" field and passing base64 encoded text:

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

I checked Ingest Attachment plugin itself, it comes with Tika preinstalled and suppose to extract file content.

Also I read this one from Taylor Lovett, creator of ElasticPress. It is interesting topic.

Please someone give me more clear example, also do I really need to use Ingest or just pre-parse file contents, then index them.

You need to base64 encode the entire PDF before you put if in that json format, then it is inserted as the data value.

Thanks for your fast response!
But as you mentioned, still we need some heavy lifting. Then what is really benefits of Ingest?

Also can give me some example codes to do this?

I have used the base64 command on the shell previously, I haven't really worked with large amounts of binary docs to automate it further sorry.

It can be used for more than PDFs :slight_smile:

Have a look at FSCrawler project. It exposes a REST endpoint where you can simply upload your binary file.

1 Like

I don't have upload problem, they are already uploaded.

Also if other plugins needed, then what is the benefits of Ingest?

They are uploaded where?

As I said, it is a Wordpress site and pdf files are attached to the post using custom fields.
So they are on the server and have a known path.

If you want to search for them, you need to index them.
One way or another you need to send their content to elasticsearch.

  • You can extract yourself the content and just send what you want to index to elasticsearch.
  • You can send the binary BASE64 to elasticsearch ingest which will do the extraction
  • You can send the binary to FSCrawler which will do the extraction before sending to elasticsearch

Yes I know I must extract content and index binaries. Then from what I'v got, it isn't simply give file path to Ingest (of course after creating pipeline and mapping) and Ingest do the extraction?

Also binary must be BASE64 formatted?

No. Elasticsearch never fetches data from a source. You have to push it.
Note that you can write your own plugin which fetches it if you wish.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.