Indexing PDFs directly

jtgajda · September 11, 2019, 6:59pm

I was able to upload 1650 PDF documents into Elasticsearch using the ingestor plugin.
However, it looks like I am locked into the schema that was generated and am also not happy with search performance.
As such, I am trying to load PDF using my own mapping that excludes the pdf content field that is being searched.
The documents do successfully load in but I am unable to search within the PDF content though I am able to search my custom metadata.
Following is my mapping and the python ES code used to load the data.

Mapping:
mapping = '''
{
"_source": {
"excludes": [
"content"
]
},
"properties": {
"title": {
"type": "text"
},
"ip": {
"type": "text"
},
"content": {
"type": "text",
"store": "true"
},
"query": {
"properties": {
"match_phrase": {
"properties": {
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
'''

Python snippet:

Read PDF as binary, convert to BASE64, back to string and then index.

with open(file, "rb") as pdf_file:
enc_pdf = base64.b64encode(pdf_file.read()).decode('ascii')
respdf = es.index(index='jdocs2', doc_type='_doc', id=mfn, body={'content': enc_pdf, "ip": ip, "title": title })

The file indexes successfully but I get no results when trying query the PDF content for known terms.
res = es.search(index="jdocs2", body={"size": 2, "query": {"match": {"title": "guide"}}})

Am I missing a step prior to indexing or are there other suggestions?
Any help would be appreciated!

Regards,
Jeff Gajda

Mark_Harwood · September 13, 2019, 2:22pm

You'd need to share an example of your parsed content field.
I don't know much about PDF formats but I expect there's much more to parsing than base64 decoding.
Have you tried using the TIKA library directly to take more control over which fields to index?

dadoonet · September 13, 2019, 4:52pm

Also have a look at FSCrawler project. It extracts a lot of metadata as well. Might be useful.

jtgajda · September 16, 2019, 2:44pm

Mark, David

Following is a sample of the output of the 64encoding and conversion back to string via the following python snipped:
with open(file, "rb") as pdf_file:
enc_pdf = base64.b64encode(pdf_file.read()).decode('ascii')
print(enc_pdf)

JVBERi0xLjYNJeLjz9MNCjQ2NSAwIG9iag08PC9GaWx0ZXIvRmxhdGVEZWNvZGUvRmlyc3QgMzcvTGVu
Z3RoIDM1Mi9OIDUvVHlwZS9PYmpTdG0+PnN0cmVhbQ0KAMLoHK6ppeTW8z4SOnBk4K3L3ZEs9dKsisPJ
jr1SRSAjvBFrd63ny8sVh4UWWovapio2+/M/QxKwOuVwhyu4agobDDv8IA2zJDg3SD9QL0KC/1xYVUQq
Bm8TPkq+7XkuZyvV1nIi5wosuzjW6hu+sLgmOhXAGIeWVM3v5QeaBLhzBedK0UnJpEtBGO69ZsmbZK1L
T3sckFFZ26Avf5Nbv2mcqKtpcDxdYwHWDPZMFb2RnxsN6MamUD3s+6WY......

I have not tried using the TIKA library as I am not that knowledgeab le yet. I will try the FSCrawler as suggested and may come back with more questions.
Thank you both for your insight.

system · October 14, 2019, 2:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing pdf documents Elasticsearch	2	5196	December 27, 2016
Indexing file (.doc,.pdf.xls etc) Elasticsearch	7	2712	July 5, 2017
Index PDF in ES Elasticsearch	14	9109	April 24, 2017
Search a PDF file using its content Elasticsearch	9	15790	February 11, 2019
Indexing PDF file in ElasticSearch using Java Code Elasticsearch	2	2602	August 28, 2018

Indexing PDFs directly

Read PDF as binary, convert to BASE64, back to string and then index.

Related topics