How to index the PDF documents

How to index the PDF and image documents into elasticsearch. Would like to extract the entities to enable the search on keywords. Whether the workplace search provide this functionality? Whether Apache Tika has been used within elasticsearch or the NLP modules to accomplish this functionality.?

Primarily would like to index few thousands of PDF/Image documents from

  1. Local file system (Windows/Linux)
  2. AWS S3 buscket.

You also asked in IIndexing PDF and Image documents. Let's keep the discussion in one single place.

You can use the ingest attachment plugin.

There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

You can use FSCrawler. There's a tutorial to help you getting started.

1 Like

Thanks David for your quick response.

I have seen both these options. FSCrawler looks to be the best option. It can feed to Workplace search as well, which provides us with nice UI for search along with facets.
If I want to use workplace search with the source as Onedrive or Sharepoint online, Whether the same functionality can be achieved?

If we want to use the ingest attachment plugin, how to feed the documents (PDF/IMAGE) in bulk?

Also I am looking into the NLP ML models which are being used in elasticsearch can help to tag these documents with relevant tags. That way the search can be done with the exact value of the identified tags.

Have a look at Connecting SharePoint Online | Workplace Search Guide [8.6] | Elastic and Connecting OneDrive | Workplace Search Guide [8.6] | Elastic

Sure. Have checked those documents and not able to see the list of supported file formats from those data sources. So default PDF and image file formats are supported?

I believe so. :blush:

Sure. Thanks.

Can you please provide your inputs for the following query?

Also I am looking into the NLP ML models which are being used in elasticsearch can help to tag these documents with relevant tags. That way the search can be done with the exact value of the identified tags.

I did not play with NLP yet. But I'd check: Overview | Machine Learning in the Elastic Stack [8.6] | Elastic

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.