How to index the PDF documents

dadoonet · March 18, 2023, 11:32am

You can use the ingest attachment plugin.

There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

You can use FSCrawler. There's a tutorial to help you getting started.

Topic		Replies	Views
IIndexing PDF and Image documents Elastic Search elastic-workplace-search	2	2450	March 19, 2023
Index PDF in Elastic App Search Elastic Search elastic-app-search	16	1869	October 30, 2020
Best way to use Ingest Attachment Plugin Elasticsearch ingest-pipeline	4	514	December 31, 2021
Index PDF in ES Elasticsearch	14	9109	April 24, 2017
How to index PDF file data and search data from attachment PDF file Elastic Search elastic-app-search	7	7780	March 29, 2021

How to index the PDF documents

Related topics