How can I extract clear text of an attachment file (pdf)

Marria · March 3, 2015, 12:40pm

Hi Charlie,

Really thank you for your help.

I used only the text extracted as in the link you gave me.

I used Python client to extract the text:

import elasticsearch
import csv
import random
import unicodedata

#replace with the IP address of your Elasticsearch node
es = elasticsearch.Elasticsearch(["127.0.0.1:9200"])

Replace the following Query with your own Elastic Search Query

res = es.search(index="fichier", body=
{
"fields": [
"file"
]
}, size=10)
random.seed(1)
sample = res['hits']['hits']
#comment previous line, and un-comment next line for a random sample instead
#randomsample = random.sample(res['hits']['hits'], 5); #change int to
RANDOMLY SAMPLE a certain number of rows from your query

print("Got %d Hits:" % res['hits']['total'])

with open('mytest.tsv', 'wb') as csvfile: #set name of output file here
filewriter = csv.writer(csvfile, delimiter='\t', # we use TAB delimited,
to handle cases where freeform text may have a comma
quotechar='|', quoting=csv.QUOTE_MINIMAL)

create header row

filewriter.writerow(["id", "fields"]) #change the column labels here
for hit in sample: #switch sample to randomsample if you want a random
subset, instead of all rows
try: #try catch used to handle unstructured data, in cases where a field
may not exist for a given hit
col1 = hit["_id"]
except Exception, e:
col1 = ""
try:
col2 = hit["fields"]

col2 = col2.replace('\n', ' ')

except Exception, e:
col2 = ""
filewriter.writerow([col1,col2])

And, it works! I get all the text from the file.

Realy Charlie, thank you

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1dfaa9a1-1933-4715-a73a-8613bb7acbd5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Can we perform the text search present in the images or pdf files through elasticsearch Elasticsearch	9	3225	July 5, 2017
Extracted text visibility from a Tika-processed attachment Elasticsearch	1	603	July 6, 2017
Automatic Keywords extraction in ElasticSearch Elasticsearch	15	6498	July 6, 2017
ES + Attachment --> indexed documents incomplete Elasticsearch	11	638	July 6, 2017
Analyzers and JSON Elasticsearch	16	570	July 6, 2017

How can I extract clear text of an attachment file (pdf)

Replace the following Query with your own Elastic Search Query

create header row

col2 = col2.replace('\n', ' ')

Related topics