How to index PDF file data and search data from attachment PDF file

Hello,

I want to index PDF file data in Elastic App Search and search data from PDF file.
Please suggested is it possible to handle attachment PDF in Elastic App search?

Regards

I might add this feature to FSCrawler although I'm unsure how useful this could be.

What is the use case?

Note that FSCrawler supports workplace search.

Anyway, what you can do is to use the ingest attachment plugin and the ingest simulate API.

  1. Send to this API the PDF.
  2. Get back the extracted text and the metadata
  3. Create with that your Json document
  4. Send it to AppSearch.
1 Like

Thank you for support!
But can you please suggest how to install or implement ingest attachment plugin and the ingest simulate API with Elastic App Search?

I have Elastic Enterprise Search Stander account.

Are you running on cloud or locally ?

I am using aws Asia Pacific (Tokyo) on cloud.

So you need to add the ingest attachment plugin

Click on Settings and Plugins:

image

Add the plugin. And don't forget to save the changes.

After the cluster has been updated, you will be able to use the Elasticsearch endpoint to call the _simulate API. See Simulate pipeline API | Elasticsearch Reference [7.11] | Elastic

If you mix that with the plugin documentation, you should be able to execute something like:

# Create the pipeline
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}

Use the simulate endpoint

POST /_ingest/pipeline/attachment/_simulate
{
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
          "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
      }
    }
  ]
}

This will give you an output. Use that content to build your own JSON and send that to AppSearch.

Note that e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0= is a BASE64 encoding of a binary file. (A text file here).

2 Likes

Thank you for update! Let me try this.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.