I'm using the ingest-attachment plugin to parse PDF files in an ElasticSearch 7 cluster. Each PDF file gives additional informations to an already existing document.
I try to create a query which retrieves all the documents which contains a given text, either in their properties, either in their corresponding PDF file.
Ideally, I would like to store the PDF file content as a field of the already existing document, but I can't find a way to do it with the ingest-attachment plugin.
As a workaround, I thought of making a kind of one-to-one join query, but some sources say that it should be avoided if possible.
OOTB the ingest attachment plugin also stores the binary content. I'm not sure I understood your question or problem. Could you explain?
That said, I'm not a fan of storing the binary content in Elasticsearch. Instead, I'd recommend storing it in a 3rd party service (like an http service) and just store the link inside the final document (in addition to the extracted text).
Indeed, I have created the same pipeline in order to keep the text content of the PDF without the binary content.
I have an index containing people with some properties (firstname, lastname...). For each people, I can have a resume, and I want to find the people by firstname, lastname, or resume content.
Currently, I have two different indexes, one for the people, and one for the resumes and I don't know how to make the query. I suppose that the modelling of the data should be improved. For instance, people metadata and resume content could be in the same document, but I don't know if it's possible with the ingest-attachment plugin.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.