Parsing and indexing documents with Apache Tika

Jean_Wisser · May 6, 2015, 1:30pm

Hello everyone,

I'm trying to parse and index .doc files into elasticsearch with apache Tika.
Actually, my project is to build a resume search engine for my company.

Since we have a standardized resume format, I would like to parse these resume using apache tika in Java.

Basically I have a .doc file like this :

   Jean Wisser                                           avenue des Ternes
                                                          75017 Paris
   Business Intelligence Consultant

   Skills : Qlikview, SAS, Cognos, ...
   Companies : IBM, Orange, ...

And I would like to extract and parse the content to index it in elasticsearch like this :

 XContentBuilder builder = jsonBuilder()
    .startObject()
        .field("Name", "Jean")
        .field("Lastname", "Wisser")
        .startObject("Adress")
                .field("Street", "avenue des Ternes")
                .field("City", "Paris")
           ......
           .endObject()
    .endObject()

What is the best way to achieve this ?
Should I use Tika, POI or something else ?

mattweber · May 6, 2015, 2:36pm

You are looking for the Mapper Attachments Plugin [1]. With this you just need to configure it in your mapping and send the base64 encoded .doc file to elasticsearch.

[1] https://github.com/elastic/elasticsearch-mapper-attachments

Jean_Wisser · May 6, 2015, 3:03pm

The thing is, with the mapper attachment plugin I can only extract the content of my files in plain text.

I would like to parse every .doc to create a specific elastic document that match my .doc structure.

I don't want to have : (this is what I get currently with the plugin)

{
    "my_attachment" : {
        "_content_type" : "application/msword",
        "_name" : "resource/name/of/my.doc",
        "_language" : "en",
        "_content" : "... base64 encoded attachment ..."
    }
}

I would like to have :

     {
        "my_attachment" : {
            "_content_type" : "application/msword",
            "_name" : "resource/name/of/my.doc",
            "_language" : "en",
            "_adress": {
                  "street": "99 avenue des ternes",
                  "city": "Paris"
             }
             ......
        }
    }

mattweber · May 6, 2015, 3:24pm

Yes, the plugin only handles metadata and text content. If you have business logic that defines how to extract fields from the content then you would need to do that before submitting to elasticsearch. I guess if you really had to do it in elasticsearch you could write a transform script [1] do extract the data you want and put into fields, but even with that the source (ie. json) would not be updated.

My recommendation would to use logstash with some custom filters. You would have one that uses tika to extract the content, then another one to perform your business logic that extracts fields from the content, and use the elasticsearch output to submit to elasticsearch.

[1] http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-transform.html

Jean_Wisser · May 6, 2015, 3:57pm

Thanks a lot for your answer Matt.

Currently I'm doing everything with the Java API, and achieved to extract the content of my files with Tika, but I'm stuck at the second step where I need to extract my fields. Do you have any idea how to get this done ?

Edit : my idea for this part is to convert the content of the .doc to xhtml with tika and then use something like xpath to parse the fields.

Also, since I will extract my fields and not send plain text do I still need to encode anything in base64 ?

Thanks !

mattweber · May 6, 2015, 6:56pm

That's the tricky part, I assumed you had a way to do that. You will probably want to use something simple like regular expressions to start and then do some research into entity extraction. Good luck.

Yongyao_Jiang · February 19, 2016, 3:22am

Hi Jean,

I am having exactly the same problem. The only difference is that I am trying to index more data formats (pdf, doc, ppt...). May I ask what you ended up doing?

Thanks,
Cody

Jean_Wisser · February 22, 2016, 5:23pm

Hi Cody,

I ended up using Talend with apache tika to extract the content of my documents and parse my fields with java functions and regex.

Yongyao_Jiang · February 22, 2016, 8:51pm

Jean, thanks for your reply. So, you are not using ES, including the mapper plugin, at all? I was also wondering if ES is a good option for primary data storage.

Thanks,
Cody

Jean_Wisser · February 22, 2016, 9:37pm

I used ES but without the mapper plugin. My process was like this :
Documents -> Talend (apache tika + regex) -> MongoDB -> ES (with mongoDB river).

I think that you should consider something else (like mongoDB) to store data and plug your elasticsearch on it.

Yongyao_Jiang · February 23, 2016, 2:54am

Thanks a lot. I am looking at Cassandra and trying to put it together with
ES for search workload.

Topic		Replies	Views
Indexing all pdfs within a folder Elasticsearch	2	462	December 12, 2018
I'm trying to parse and index .doc files into elasticsearch with apache Tika Elasticsearch	2	488	March 16, 2017
Indexing office documents Elasticsearch	5	1835	July 6, 2017
Getting the extracted content from the attachment mapper plugin Elasticsearch	6	388	July 6, 2017
How to index text files (pdf, doc, txt...) in Java? Elasticsearch	6	2632	January 18, 2023

Parsing and indexing documents with Apache Tika

Related topics