Indexing PDFs directly

I was able to upload 1650 PDF documents into Elasticsearch using the ingestor plugin.
However, it looks like I am locked into the schema that was generated and am also not happy with search performance.
As such, I am trying to load PDF using my own mapping that excludes the pdf content field that is being searched.
The documents do successfully load in but I am unable to search within the PDF content though I am able to search my custom metadata.
Following is my mapping and the python ES code used to load the data.

Mapping:
mapping = '''
{
"_source": {
"excludes": [
"content"
]
},
"properties": {
"title": {
"type": "text"
},
"ip": {
"type": "text"
},
"content": {
"type": "text",
"store": "true"
},
"query": {
"properties": {
"match_phrase": {
"properties": {
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
'''

Python snippet:

Read PDF as binary, convert to BASE64, back to string and then index.

with open(file, "rb") as pdf_file:
enc_pdf = base64.b64encode(pdf_file.read()).decode('ascii')
respdf = es.index(index='jdocs2', doc_type='_doc', id=mfn, body={'content': enc_pdf, "ip": ip, "title": title })

The file indexes successfully but I get no results when trying query the PDF content for known terms.
res = es.search(index="jdocs2", body={"size": 2, "query": {"match": {"title": "guide"}}})

Am I missing a step prior to indexing or are there other suggestions?
Any help would be appreciated!

Regards,
Jeff Gajda

You'd need to share an example of your parsed content field.
I don't know much about PDF formats but I expect there's much more to parsing than base64 decoding.
Have you tried using the TIKA library directly to take more control over which fields to index?

Also have a look at FSCrawler project. It extracts a lot of metadata as well. Might be useful.

Mark, David

Following is a sample of the output of the 64encoding and conversion back to string via the following python snipped:
with open(file, "rb") as pdf_file:
enc_pdf = base64.b64encode(pdf_file.read()).decode('ascii')
print(enc_pdf)

JVBERi0xLjYNJeLjz9MNCjQ2NSAwIG9iag08PC9GaWx0ZXIvRmxhdGVEZWNvZGUvRmlyc3QgMzcvTGVu
Z3RoIDM1Mi9OIDUvVHlwZS9PYmpTdG0+PnN0cmVhbQ0KAMLoHK6ppeTW8z4SOnBk4K3L3ZEs9dKsisPJ
jr1SRSAjvBFrd63ny8sVh4UWWovapio2+/M/QxKwOuVwhyu4agobDDv8IA2zJDg3SD9QL0KC/1xYVUQq
Bm8TPkq+7XkuZyvV1nIi5wosuzjW6hu+sLgmOhXAGIeWVM3v5QeaBLhzBedK0UnJpEtBGO69ZsmbZK1L
T3sckFFZ26Avf5Nbv2mcqKtpcDxdYwHWDPZMFb2RnxsN6MamUD3s+6WY......

I have not tried using the TIKA library as I am not that knowledgeab le yet. I will try the FSCrawler as suggested and may come back with more questions.
Thank you both for your insight.