I'm trying to parse and index .doc files into elasticsearch with apache Tika.
Actually, my project is to build a resume search engine for my company.
Since we have a standardized resume format, I would like to parse these resume using apache tika in Java.
Basically I have a .doc file like this :
Jean Wisser avenue des Ternes 75017 Paris Business Intelligence Consultant Skills : Qlikview, SAS, Cognos, ... Companies : IBM, Orange, ...
And I would like to extract and parse the content to index it in elasticsearch like this :
XContentBuilder builder = jsonBuilder() .startObject() .field("Name", "Jean") .field("Lastname", "Wisser") .startObject("Adress") .field("Street", "avenue des Ternes") .field("City", "Paris") ...... .endObject() .endObject()
- What is the best way to achieve this ?
- Should I use Tika, POI or something else ?