I was able to upload 1650 PDF documents into Elasticsearch using the ingestor plugin.
However, it looks like I am locked into the schema that was generated and am also not happy with search performance.
As such, I am trying to load PDF using my own mapping that excludes the pdf content field that is being searched.
The documents do successfully load in but I am unable to search within the PDF content though I am able to search my custom metadata.
Following is my mapping and the python ES code used to load the data.
Mapping:
mapping = '''
{
"_source": {
"excludes": [
"content"
]
},
"properties": {
"title": {
"type": "text"
},
"ip": {
"type": "text"
},
"content": {
"type": "text",
"store": "true"
},
"query": {
"properties": {
"match_phrase": {
"properties": {
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
'''
Python snippet:
Read PDF as binary, convert to BASE64, back to string and then index.
with open(file, "rb") as pdf_file:
enc_pdf = base64.b64encode(pdf_file.read()).decode('ascii')
respdf = es.index(index='jdocs2', doc_type='_doc', id=mfn, body={'content': enc_pdf, "ip": ip, "title": title })
The file indexes successfully but I get no results when trying query the PDF content for known terms.
res = es.search(index="jdocs2", body={"size": 2, "query": {"match": {"title": "guide"}}})
Am I missing a step prior to indexing or are there other suggestions?
Any help would be appreciated!
Regards,
Jeff Gajda