My question is, can we index page by page of PDFs with fscrawler directly. E.g, can fscrawler do the parent/child relationship while indexing the docs and then, when doing the search query in Elasticsearch, it returns page result and not the full document.
If you're rolling your own, you can get the XHTML output from Tika, and we do mark page breaks in PDFs as <div class="page"></div> so you should be able to parse out the contents per page.
Super interesting @tallison. Do you think I can then ask an xhtml output from Tika, then read it and for each <div class="page"></div> I can find, send the content to Tika again to extract all the page content?
It is simpler than that. Just use the ToXMLContentHandler to get an XML String, and then run a SAXParser (or JSoup in case we're not getting our tags right :D) against that xml, and parse the content per page. No need to send anything back to Tika.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.