How to index PDF file data and search data from attachment PDF file

dadoonet · February 26, 2021, 10:31am

So you need to add the ingest attachment plugin

Click on Settings and Plugins:

Add the plugin. And don't forget to save the changes.

After the cluster has been updated, you will be able to use the Elasticsearch endpoint to call the _simulate API. See Simulate pipeline API | Elasticsearch Reference [7.11] | Elastic

If you mix that with the plugin documentation, you should be able to execute something like:

# Create the pipeline
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}

Use the simulate endpoint

POST /_ingest/pipeline/attachment/_simulate
{
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
          "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
      }
    }
  ]
}

This will give you an output. Use that content to build your own JSON and send that to AppSearch.

Note that e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0= is a BASE64 encoding of a binary file. (A text file here).

Topic		Replies	Views
Index PDF in ES Elasticsearch	14	9109	April 24, 2017
Search a PDF file using its content Elasticsearch	9	15788	February 11, 2019
Is it possible to index Files (PDF, DOC, PPT) using App Search? Elastic Search	5	1295	November 4, 2022
Appsearch support for large attachments Elastic Search elastic-app-search	5	619	November 19, 2021
Elasticsearch - attachment using Ingest - with node.js Elasticsearch	2	2614	June 21, 2017

How to index PDF file data and search data from attachment PDF file

Related topics