Hello, My question is, can we index page by page of PDFs with fscrawler directly. E.g, can fscrawler do the parent/child relationship while indexing the docs and then, when doing the search query in Elasticsearch, it returns page result and not the full document. Thanks for your precious help. G. …

Fscrawler/Elasticsearch page by page indexing

tallison (Tim Allison) June 28, 2019, 6:10pm 5

It is simpler than that. Just use the ToXMLContentHandler to get an XML String, and then run a SAXParser (or JSoup in case we're not getting our tags right :D) against that xml, and parse the content per page. No need to send anything back to Tika.

I can demo it for you pretty easily...

1 Like

Topic		Replies	Views
FSCrawler Index Each Page as a Separate Document Elasticsearch	2	868	October 18, 2019
Fs-crawler for data scraping Elasticsearch	2	306	April 27, 2022
Indexing all pdfs within a folder Elasticsearch	2	487	December 12, 2018
Store PDF documents page-wise in ES Elasticsearch	2	575	November 23, 2019
Getting metadata of the extracted text from a file Elasticsearch	3	593	July 23, 2018

Fscrawler/Elasticsearch page by page indexing

Related topics