Indexing PDFs directly

I was able to upload 1650 PDF documents into Elasticsearch using the ingestor plugin.
However, it looks like I am locked into the schema that was generated and am also not happy with search performance.
As such, I am trying to load PDF using my own mapping that excludes the pdf content field that is being searched.
The documents do successfully load in but I am unable to search within the PDF content though I am able to search my custom metadata.
Following is my mapping and the python ES code used to load the data.

Mapping:
mapping = '''
{
"_source": {
"excludes": [
"content"
]
},
"properties": {
"title": {
"type": "text"
},
"ip": {
"type": "text"
},
"content": {
"type": "text",
"store": "true"
},
"query": {
"properties": {
"match_phrase": {
"properties": {
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
'''

Python snippet:

Read PDF as binary, convert to BASE64, back to string and then index.

with open(file, "rb") as pdf_file:
enc_pdf = base64.b64encode(pdf_file.read()).decode('ascii')
respdf = es.index(index='jdocs2', doc_type='_doc', id=mfn, body={'content': enc_pdf, "ip": ip, "title": title })

The file indexes successfully but I get no results when trying query the PDF content for known terms.
res = es.search(index="jdocs2", body={"size": 2, "query": {"match": {"title": "guide"}}})

Am I missing a step prior to indexing or are there other suggestions?
Any help would be appreciated!

Regards,
Jeff Gajda

You'd need to share an example of your parsed content field.
I don't know much about PDF formats but I expect there's much more to parsing than base64 decoding.
Have you tried using the TIKA library directly to take more control over which fields to index?

Also have a look at FSCrawler project. It extracts a lot of metadata as well. Might be useful.

Mark, David

Following is a sample of the output of the 64encoding and conversion back to string via the following python snipped:
with open(file, "rb") as pdf_file:
enc_pdf = base64.b64encode(pdf_file.read()).decode('ascii')
print(enc_pdf)

JVBERi0xLjYNJeLjz9MNCjQ2NSAwIG9iag08PC9GaWx0ZXIvRmxhdGVEZWNvZGUvRmlyc3QgMzcvTGVu
Z3RoIDM1Mi9OIDUvVHlwZS9PYmpTdG0+PnN0cmVhbQ0KAMLoHK6ppeTW8z4SOnBk4K3L3ZEs9dKsisPJ
jr1SRSAjvBFrd63ny8sVh4UWWovapio2+/M/QxKwOuVwhyu4agobDDv8IA2zJDg3SD9QL0KC/1xYVUQq
Bm8TPkq+7XkuZyvV1nIi5wosuzjW6hu+sLgmOhXAGIeWVM3v5QeaBLhzBedK0UnJpEtBGO69ZsmbZK1L
T3sckFFZ26Avf5Nbv2mcqKtpcDxdYwHWDPZMFb2RnxsN6MamUD3s+6WY......

I have not tried using the TIKA library as I am not that knowledgeab le yet. I will try the FSCrawler as suggested and may come back with more questions.
Thank you both for your insight.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.