PDF Search

ai_thoughts · September 8, 2018, 2:27pm

I am looking into handrolling a large pdf document search via Elastic Search. I am looking into Apache Tika for parsing and then indexing it via Elastic Search. The question is, if I have to locate the specific sections within the pdf - how would I go about it ? My thinking is I would need to break the pdf down into multiple sections before indexing. Appreciate any pointers, if there are any plugins available.

loren · September 8, 2018, 2:55pm

There's this plugin that will attempt to extract content , title , name , author , keywords , date , content_type , content_length , and language.

system · October 6, 2018, 2:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing all pdfs within a folder Elasticsearch	2	462	December 12, 2018
Indexing pdf, word, text, image files Elasticsearch	2	677	April 27, 2017
Is it inefficient to index PDF files in Elasticsearch Elasticsearch	8	4135	August 25, 2017
Fscrawler/Elasticsearch page by page indexing Elasticsearch	6	702	July 26, 2019
How to index text files (pdf, doc, txt...) in Java? Elasticsearch	6	2629	January 18, 2023

PDF Search

Related topics