Parsing and indexing documents with Apache Tika


(Jean Wisser) #1

Hello everyone,

I'm trying to parse and index .doc files into elasticsearch with apache Tika.
Actually, my project is to build a resume search engine for my company.

Since we have a standardized resume format, I would like to parse these resume using apache tika in Java.

Basically I have a .doc file like this :

   Jean Wisser                                           avenue des Ternes
                                                          75017 Paris
   Business Intelligence Consultant

   Skills : Qlikview, SAS, Cognos, ...
   Companies : IBM, Orange, ...

And I would like to extract and parse the content to index it in elasticsearch like this :

 XContentBuilder builder = jsonBuilder()
    .startObject()
        .field("Name", "Jean")
        .field("Lastname", "Wisser")
        .startObject("Adress")
                .field("Street", "avenue des Ternes")
                .field("City", "Paris")
           ......
           .endObject()
    .endObject()
  • What is the best way to achieve this ?
  • Should I use Tika, POI or something else ?

Indexing all pdfs within a folder
(Matt Weber) #2

You are looking for the Mapper Attachments Plugin [1]. With this you just need to configure it in your mapping and send the base64 encoded .doc file to elasticsearch.

[1] https://github.com/elastic/elasticsearch-mapper-attachments


(Jean Wisser) #3

The thing is, with the mapper attachment plugin I can only extract the content of my files in plain text.

I would like to parse every .doc to create a specific elastic document that match my .doc structure.

I don't want to have : (this is what I get currently with the plugin)

{
    "my_attachment" : {
        "_content_type" : "application/msword",
        "_name" : "resource/name/of/my.doc",
        "_language" : "en",
        "_content" : "... base64 encoded attachment ..."
    }
}

I would like to have :

     {
        "my_attachment" : {
            "_content_type" : "application/msword",
            "_name" : "resource/name/of/my.doc",
            "_language" : "en",
            "_adress": {
                  "street": "99 avenue des ternes",
                  "city": "Paris"
             }
             ......
        }
    }

(Matt Weber) #4

Yes, the plugin only handles metadata and text content. If you have business logic that defines how to extract fields from the content then you would need to do that before submitting to elasticsearch. I guess if you really had to do it in elasticsearch you could write a transform script [1] do extract the data you want and put into fields, but even with that the source (ie. json) would not be updated.

My recommendation would to use logstash with some custom filters. You would have one that uses tika to extract the content, then another one to perform your business logic that extracts fields from the content, and use the elasticsearch output to submit to elasticsearch.

[1] http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-transform.html


(Jean Wisser) #5

Thanks a lot for your answer Matt.

Currently I'm doing everything with the Java API, and achieved to extract the content of my files with Tika, but I'm stuck at the second step where I need to extract my fields. Do you have any idea how to get this done ?

Edit : my idea for this part is to convert the content of the .doc to xhtml with tika and then use something like xpath to parse the fields.

Also, since I will extract my fields and not send plain text do I still need to encode anything in base64 ?

Thanks !


(Matt Weber) #6

That's the tricky part, I assumed you had a way to do that. You will probably want to use something simple like regular expressions to start and then do some research into entity extraction. Good luck.


(Cody) #7

Hi Jean,

I am having exactly the same problem. The only difference is that I am trying to index more data formats (pdf, doc, ppt...). May I ask what you ended up doing?

Thanks,
Cody


(Jean Wisser) #8

Hi Cody,

I ended up using Talend with apache tika to extract the content of my documents and parse my fields with java functions and regex.


(Cody) #9

Jean, thanks for your reply. So, you are not using ES, including the mapper plugin, at all? I was also wondering if ES is a good option for primary data storage.

Thanks,
Cody


(Jean Wisser) #10

I used ES but without the mapper plugin. My process was like this :
Documents -> Talend (apache tika + regex) -> MongoDB -> ES (with mongoDB river).

I think that you should consider something else (like mongoDB) to store data and plug your elasticsearch on it.


(Cody) #11

Thanks a lot. I am looking at Cassandra and trying to put it together with
ES for search workload.


(system) #12