How to index PDF file data and search data from attachment PDF file

So you need to add the ingest attachment plugin

Click on Settings and Plugins:

image

Add the plugin. And don't forget to save the changes.

After the cluster has been updated, you will be able to use the Elasticsearch endpoint to call the _simulate API. See Simulate pipeline API | Elasticsearch Reference [7.11] | Elastic

If you mix that with the plugin documentation, you should be able to execute something like:

# Create the pipeline
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}

Use the simulate endpoint

POST /_ingest/pipeline/attachment/_simulate
{
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
          "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
      }
    }
  ]
}

This will give you an output. Use that content to build your own JSON and send that to AppSearch.

Note that e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0= is a BASE64 encoding of a binary file. (A text file here).

2 Likes