Fscrawler/Elasticsearch page by page indexing

Gael_Latouche · June 28, 2019, 5:26am

Hello,

My question is, can we index page by page of PDFs with fscrawler directly. E.g, can fscrawler do the parent/child relationship while indexing the docs and then, when doing the search query in Elasticsearch, it returns page result and not the full document.

Thanks for your precious help.

G.

dadoonet · June 28, 2019, 5:41am

Welcome.

No it can not. AFAIK Tika does not produce page per page extraction sadly.

tallison · June 28, 2019, 5:19pm

If you're rolling your own, you can get the XHTML output from Tika, and we do mark page breaks in PDFs as <div class="page"></div> so you should be able to parse out the contents per page.

dadoonet · June 28, 2019, 5:46pm

Super interesting @tallison. Do you think I can then ask an xhtml output from Tika, then read it and for each <div class="page"></div> I can find, send the content to Tika again to extract all the page content?

That could be a great addition to FSCrawler...

tallison · June 28, 2019, 6:10pm

It is simpler than that. Just use the ToXMLContentHandler to get an XML String, and then run a SAXParser (or JSoup in case we're not getting our tags right :D) against that xml, and parse the content per page. No need to send anything back to Tika.

I can demo it for you pretty easily...

tallison · June 28, 2019, 6:12pm

And, while you're working with parent/child documents, can I interest you in the RecursiveParserWrapper, e.g. https://issues.apache.org/jira/browse/SOLR-7229

system · July 26, 2019, 6:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing all pdfs within a folder Elasticsearch	2	462	December 12, 2018
Handling Page breaks in Elasticsearch Elasticsearch	3	460	November 29, 2018
PDF Search Elasticsearch	2	451	October 6, 2018
I'm trying to parse and index .doc files into elasticsearch with apache Tika Elasticsearch	2	488	March 16, 2017
Possible to Index PDFs by page? Elasticsearch	6	3778	July 6, 2017

Fscrawler/Elasticsearch page by page indexing

Related topics