Fscrawler/Elasticsearch page by page indexing

Hello,

My question is, can we index page by page of PDFs with fscrawler directly. E.g, can fscrawler do the parent/child relationship while indexing the docs and then, when doing the search query in Elasticsearch, it returns page result and not the full document.

Thanks for your precious help.

G.

Welcome.

No it can not. AFAIK Tika does not produce page per page extraction sadly.

If you're rolling your own, you can get the XHTML output from Tika, and we do mark page breaks in PDFs as <div class="page"></div> so you should be able to parse out the contents per page.

1 Like

Super interesting @tallison. Do you think I can then ask an xhtml output from Tika, then read it and for each <div class="page"></div> I can find, send the content to Tika again to extract all the page content?

That could be a great addition to FSCrawler...

It is simpler than that. Just use the ToXMLContentHandler to get an XML String, and then run a SAXParser (or JSoup in case we're not getting our tags right :D) against that xml, and parse the content per page. No need to send anything back to Tika.

I can demo it for you pretty easily...

1 Like

And, while you're working with parent/child documents, can I interest you in the RecursiveParserWrapper, e.g. https://issues.apache.org/jira/browse/SOLR-7229 :grin:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.