Indexing PDF's and Perform Text Analytics with ES


(Rahul Nama) #1

Hello All

I've Indexed PDF into ES. Now I want to try out text analytics on the same pdf.

Here is sample pdf after indexing in kibana

Now I want to run elastic's cool analysers(something related to NLP: like removing stop words etc) on the indexed text. How is it possible.

what are the additional steps that I need to follow? Please let me know

thanks for your time as always :slight_smile:


(Mark Harwood) #2

I may be misunderstanding the question but removing stop words is something that happens during indexing not after. What you're looking at in Kibana is the original JSON source form, not the indexed tokens that the analysis process created. Some docs that may help: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html


(Rahul Nama) #3

Hey @Mark_Harwood

yea you are right.
So should I use logstash here for applying tokenizers(and/or analysers) or
this can be done via python elasticsearch client ?


(Mark Harwood) #4

Choice of tokenizers and/or analyzers is defined in the index mapping.
The docs I linked to show the REST APIs that help you test and define mappings.
You can use any choice of client (Curl. Perl, python, logstash, ruby ...) to talk to the elasticsearch Rest API.

If you create indexes every day with similar mappings you can define those mappings once in a template that automatically applies to any new indices that match the template's chosen index naming pattern


(Rahul Nama) #5

Thanks @Mark_Harwood :slight_smile:

So while Indexing the pdf's I need to define the mappings with analyzers to make it compatible for further processing(calling via api after indexing for further analysis with third-party library).
In addition, I should also store the pages of PDF as I should also use it for full text search.
this can be achieved with the single mapping right?


(Mark Harwood) #6

If we ignore the more advanced topic of templates the process is as follows:

  1. Define mapping for your new index using the put mapping api.
  2. Parse the PDF to extract plain text
  3. Present the plain text strings in a JSON document to elasticsearch's indexing API
  4. Search using the search api

Step 2 is typically done outside of elasticsearch e.g. using the Tika framework for parsing docs.


(Rahul Nama) #7

Cleared many of my doubts. you kept it simple :slight_smile: @Mark_Harwood

still a small confusion in step-3

does plain text strings mean each page of pdf ?


(Mark Harwood) #8

Not necessarily. If the PDFs are insanely large then maybe you'd have to resort to separate docs but I expect a single string for the text and maybe separate strings for any structured fields Tika might give you e.g. filename, author


(Rahul Nama) #9

@Mark_Harwood

Thanks much :slight_smile:

As I need to perform both full-text search and text transformation/Analytics. In a document I need to store both tokenized/analysed text and the content of pdf/page to retrieve while searching.

Isn't it? or I'm missing something here.


(Mark Harwood) #10

There's 3 representations of the doc:

  1. The original binary PDF file - markup/font choices and all.
  2. The plain text string held as a field in a JSON doc
  3. The individual words of the text stored as tokens inside a search index.

Your parsing app uses a tool like Tika to make 2) from 1). It sends 2) to elasticsearch.
Elasticsearch stores 2) and uses a choice of Analyzer from the index mapping to create and store 3).

Elasticsearch doesn't do anything with 1) - or other 1-like document formats such as Word, Excel, Powerpoint etc. It works only with 2) ie JSON


(Rahul Nama) #11

@Mark_Harwood

Yes much clear now :slight_smile:

that means json produced in step-2 should be given as input to the step-3 right

Thanks so much for your time. :slight_smile:


(Rahul Nama) #12

hey @Mark_Harwood. required your suggestions here please help.

as discussed I'm Trying to Index pdfs.

I'm confused with the below two methods.

  1. Convert to text from pdf's using tika and then index to elasticsearch.

or

  1. use ingest attachment processor to index from pdfs directly.

I've 40-50 pdf's each having around 8-10 pages. what is the best way to index them?

Please suggest


(system) #13

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.