PDF Search

I am looking into handrolling a large pdf document search via Elastic Search. I am looking into Apache Tika for parsing and then indexing it via Elastic Search. The question is, if I have to locate the specific sections within the pdf - how would I go about it ? My thinking is I would need to break the pdf down into multiple sections before indexing. Appreciate any pointers, if there are any plugins available.

There's this plugin that will attempt to extract content , title , name , author , keywords , date , content_type , content_length , and language.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.