Searching through PDF attachments and other documents in ElasticSearch with one query

Hello everybody,

I'm using the ingest-attachment plugin to parse PDF files in an ElasticSearch 7 cluster. Each PDF file gives additional informations to an already existing document.

I try to create a query which retrieves all the documents which contains a given text, either in their properties, either in their corresponding PDF file.

Ideally, I would like to store the PDF file content as a field of the already existing document, but I can't find a way to do it with the ingest-attachment plugin.

As a workaround, I thought of making a kind of one-to-one join query, but some sources say that it should be avoided if possible.

Is there a proper solution for this use case?

Welcome!

OOTB the ingest attachment plugin also stores the binary content. I'm not sure I understood your question or problem. Could you explain?

That said, I'm not a fan of storing the binary content in Elasticsearch. Instead, I'd recommend storing it in a 3rd party service (like an http service) and just store the link inside the final document (in addition to the extracted text).

I'd do something like:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    },{
      "remove": {
        "field": "data"
      }
    }
  ]
}
PUT my-index-00001/_doc/my_id?pipeline=attachment
{
  "url": "http://server/path/to/file.txt",
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my-index-00001/_doc/my_id

This should give:

{
  "found": true,
  "_index": "my-index-00001",
  "_type": "_doc",
  "_id": "my_id",
  "_version": 1,
  "_seq_no": 22,
  "_primary_term": 1,
  "_source": {
    "url": "http://server/path/to/file.txt",
    "attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem ipsum dolor sit amet",
      "content_length": 28
    }
  }
}

Hi @dadoonet, thank you for your answer.

Indeed, I have created the same pipeline in order to keep the text content of the PDF without the binary content.

I have an index containing people with some properties (firstname, lastname...). For each people, I can have a resume, and I want to find the people by firstname, lastname, or resume content.

Currently, I have two different indexes, one for the people, and one for the resumes and I don't know how to make the query. I suppose that the modelling of the data should be improved. For instance, people metadata and resume content could be in the same document, but I don't know if it's possible with the ingest-attachment plugin.

yeah. I'd most likely do something like:

PUT my-index-00001/_doc/my_id?pipeline=attachment
{
  "name": "David P",
  "birthdate": "XX/XX/XXXX",
  "country": "France",
  "resume": {
    "url": "http://server/path/to/file.txt",
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  }
}

Another idea is to use the enrich processor to do the lookup at index time. See https://www.elastic.co/guide/en/elasticsearch/reference/current/match-enrich-policy-type.html

Hi @dadoonet, thanks for your answer. I've tested and it works perfectly!!!

I have one last question regarding this topic. Is it possible to put the attachment without resending the people properties (name, birthdate...)?

Currently, I need to do a GET query before every PUT query in order to retrieve the people properties.

I have checked the Update API, and it's not able to call a pipeline. I have also checked the script processor, but I feel it can't solve it neither.

No. You need to send everything again.

The GET you are doing could be done though by the enrich processor. It would do the lookup at index time. See

In which case, you would only send something like:

PUT my-index-00001/_doc/my_id?pipeline=attachment
{
  "username": "dadoonet",
  "resume": {
    "url": "http://server/path/to/file.txt",
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  }
}