Indexing PDF's and Perform Text Analytics with ES

rahulnama · August 31, 2018, 5:57am

Hello All

I've Indexed PDF into ES. Now I want to try out text analytics on the same pdf.

Here is sample pdf after indexing in kibana

Now I want to run elastic's cool analysers(something related to NLP: like removing stop words etc) on the indexed text. How is it possible.

what are the additional steps that I need to follow? Please let me know

thanks for your time as always

Mark_Harwood · August 31, 2018, 9:41am

I may be misunderstanding the question but removing stop words is something that happens during indexing not after. What you're looking at in Kibana is the original JSON source form, not the indexed tokens that the analysis process created. Some docs that may help: Text analysis | Elasticsearch Guide [8.11] | Elastic

rahulnama · August 31, 2018, 9:52am

Hey @Mark_Harwood

yea you are right.
So should I use logstash here for applying tokenizers(and/or analysers) or
this can be done via python elasticsearch client ?

Mark_Harwood · August 31, 2018, 9:57am

Choice of tokenizers and/or analyzers is defined in the index mapping.
The docs I linked to show the REST APIs that help you test and define mappings.
You can use any choice of client (Curl. Perl, python, logstash, ruby ...) to talk to the elasticsearch Rest API.

If you create indexes every day with similar mappings you can define those mappings once in a template that automatically applies to any new indices that match the template's chosen index naming pattern

rahulnama · August 31, 2018, 10:05am

Thanks @Mark_Harwood

So while Indexing the pdf's I need to define the mappings with analyzers to make it compatible for further processing(calling via api after indexing for further analysis with third-party library).
In addition, I should also store the pages of PDF as I should also use it for full text search.
this can be achieved with the single mapping right?

Mark_Harwood · August 31, 2018, 11:24am

If we ignore the more advanced topic of templates the process is as follows:

Define mapping for your new index using the put mapping api.
Parse the PDF to extract plain text
Present the plain text strings in a JSON document to elasticsearch's indexing API
Search using the search api

Step 2 is typically done outside of elasticsearch e.g. using the Tika framework for parsing docs.

rahulnama · August 31, 2018, 11:32am

Cleared many of my doubts. you kept it simple @Mark_Harwood

still a small confusion in step-3

does plain text strings mean each page of pdf ?

Mark_Harwood · August 31, 2018, 12:46pm

Not necessarily. If the PDFs are insanely large then maybe you'd have to resort to separate docs but I expect a single string for the text and maybe separate strings for any structured fields Tika might give you e.g. filename, author

rahulnama · August 31, 2018, 2:21pm

@Mark_Harwood

Thanks much

As I need to perform both full-text search and text transformation/Analytics. In a document I need to store both tokenized/analysed text and the content of pdf/page to retrieve while searching.

Isn't it? or I'm missing something here.

Mark_Harwood · August 31, 2018, 3:29pm

There's 3 representations of the doc:

The original binary PDF file - markup/font choices and all.
The plain text string held as a field in a JSON doc
The individual words of the text stored as tokens inside a search index.

Your parsing app uses a tool like Tika to make 2) from 1). It sends 2) to elasticsearch.
Elasticsearch stores 2) and uses a choice of Analyzer from the index mapping to create and store 3).

Elasticsearch doesn't do anything with 1) - or other 1-like document formats such as Word, Excel, Powerpoint etc. It works only with 2) ie JSON

rahulnama · August 31, 2018, 4:20pm

@Mark_Harwood

Yes much clear now

that means json produced in step-2 should be given as input to the step-3 right

Thanks so much for your time.

rahulnama · September 11, 2018, 12:46pm

hey @Mark_Harwood. required your suggestions here please help.

as discussed I'm Trying to Index pdfs.

I'm confused with the below two methods.

Convert to text from pdf's using tika and then index to elasticsearch.

or

use ingest attachment processor to index from pdfs directly.

I've 40-50 pdf's each having around 8-10 pages. what is the best way to index them?

Please suggest

system · October 9, 2018, 12:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Index pre-analyzed text by sending the actual terms/tokens? Elasticsearch	6	724	December 10, 2020
How to apply English analyzer to a set of documents already indexed Elasticsearch	4	680	June 26, 2017
Can we use elasticsearch analyzer for kibana Elasticsearch	3	339	February 2, 2022
Noise Word handling on ingest Elasticsearch	8	1194	February 18, 2020
Text analysis Elasticsearch	6	1271	April 8, 2019

Indexing PDF's and Perform Text Analytics with ES

Related topics