How can I extract clear text of an attachment file (pdf)

Hi Charlie,

Really thank you for your help.

I used only the text extracted as in the link you gave me.

I used Python client to extract the text:

import elasticsearch
import csv
import random
import unicodedata

#replace with the IP address of your Elasticsearch node
es = elasticsearch.Elasticsearch(["127.0.0.1:9200"])

Replace the following Query with your own Elastic Search Query

res = es.search(index="fichier", body=
{
"fields": [
"file"
]
}, size=10)
random.seed(1)
sample = res['hits']['hits']
#comment previous line, and un-comment next line for a random sample instead
#randomsample = random.sample(res['hits']['hits'], 5); #change int to
RANDOMLY SAMPLE a certain number of rows from your query

print("Got %d Hits:" % res['hits']['total'])

with open('mytest.tsv', 'wb') as csvfile: #set name of output file here
filewriter = csv.writer(csvfile, delimiter='\t', # we use TAB delimited,
to handle cases where freeform text may have a comma
quotechar='|', quoting=csv.QUOTE_MINIMAL)

create header row

filewriter.writerow(["id", "fields"]) #change the column labels here
for hit in sample: #switch sample to randomsample if you want a random
subset, instead of all rows
try: #try catch used to handle unstructured data, in cases where a field
may not exist for a given hit
col1 = hit["_id"]
except Exception, e:
col1 = ""
try:
col2 = hit["fields"]

col2 = col2.replace('\n', ' ')

except Exception, e:
col2 = ""
filewriter.writerow([col1,col2])

And, it works! I get all the text from the file.

Realy Charlie, thank you :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1dfaa9a1-1933-4715-a73a-8613bb7acbd5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.