I am looking into handrolling a large pdf document search via Elastic Search. I am looking into Apache Tika for parsing and then indexing it via Elastic Search. The question is, if I have to locate the specific sections within the pdf - how would I go about it ? My thinking is I would need to break the pdf down into multiple sections before indexing. Appreciate any pointers, if there are any plugins available.
There's this plugin that will attempt to extract content
, title
, name
, author
, keywords
, date
, content_type
, content_length
, and language
.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.