How to index the PDF documents

mruthyu · March 18, 2023, 10:14am

How to index the PDF and image documents into elasticsearch. Would like to extract the entities to enable the search on keywords. Whether the workplace search provide this functionality? Whether Apache Tika has been used within elasticsearch or the NLP modules to accomplish this functionality.?

Primarily would like to index few thousands of PDF/Image documents from

Local file system (Windows/Linux)
AWS S3 buscket.

dadoonet · March 18, 2023, 11:30am

You also asked in IIndexing PDF and Image documents. Let's keep the discussion in one single place.

dadoonet · March 18, 2023, 11:32am

You can use the ingest attachment plugin.

There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

You can use FSCrawler. There's a tutorial to help you getting started.

mruthyu · March 18, 2023, 2:50pm

Thanks David for your quick response.

I have seen both these options. FSCrawler looks to be the best option. It can feed to Workplace search as well, which provides us with nice UI for search along with facets.
If I want to use workplace search with the source as Onedrive or Sharepoint online, Whether the same functionality can be achieved?

If we want to use the ingest attachment plugin, how to feed the documents (PDF/IMAGE) in bulk?

Also I am looking into the NLP ML models which are being used in elasticsearch can help to tag these documents with relevant tags. That way the search can be done with the exact value of the identified tags.

dadoonet · March 19, 2023, 8:37pm

Have a look at Connecting SharePoint Online | Workplace Search Guide [8.6] | Elastic and Connecting OneDrive | Workplace Search Guide [8.6] | Elastic

mruthyu · March 20, 2023, 7:06am

Sure. Have checked those documents and not able to see the list of supported file formats from those data sources. So default PDF and image file formats are supported?

dadoonet · March 20, 2023, 7:14am

I believe so.

mruthyu · March 20, 2023, 9:28am

Sure. Thanks.

Can you please provide your inputs for the following query?

Also I am looking into the NLP ML models which are being used in elasticsearch can help to tag these documents with relevant tags. That way the search can be done with the exact value of the identified tags.

dadoonet · March 20, 2023, 10:36am

I did not play with NLP yet. But I'd check: Overview | Machine Learning in the Elastic Stack [8.6] | Elastic

system · April 17, 2023, 10:37am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
IIndexing PDF and Image documents Elastic Search elastic-workplace-search	2	2449	March 19, 2023
Index PDF in Elastic App Search Elastic Search elastic-app-search	16	1869	October 30, 2020
Best way to use Ingest Attachment Plugin Elasticsearch ingest-pipeline	4	514	December 31, 2021
Index PDF in ES Elasticsearch	14	9109	April 24, 2017
How to index PDF file data and search data from attachment PDF file Elastic Search elastic-app-search	7	7779	March 29, 2021

How to index the PDF documents

Related topics