Fscrawler/Elasticsearch page by page indexing

It is simpler than that. Just use the ToXMLContentHandler to get an XML String, and then run a SAXParser (or JSoup in case we're not getting our tags right :D) against that xml, and parse the content per page. No need to send anything back to Tika.

I can demo it for you pretty easily...

1 Like