Searching through PDF attachments and other documents in ElasticSearch with one query

Bertrand_Nau · September 30, 2020, 2:04pm

Hello everybody,

I'm using the ingest-attachment plugin to parse PDF files in an ElasticSearch 7 cluster. Each PDF file gives additional informations to an already existing document.

I try to create a query which retrieves all the documents which contains a given text, either in their properties, either in their corresponding PDF file.

Ideally, I would like to store the PDF file content as a field of the already existing document, but I can't find a way to do it with the ingest-attachment plugin.

As a workaround, I thought of making a kind of one-to-one join query, but some sources say that it should be avoided if possible.

Is there a proper solution for this use case?

dadoonet · September 30, 2020, 4:25pm

Welcome!

OOTB the ingest attachment plugin also stores the binary content. I'm not sure I understood your question or problem. Could you explain?

That said, I'm not a fan of storing the binary content in Elasticsearch. Instead, I'd recommend storing it in a 3rd party service (like an http service) and just store the link inside the final document (in addition to the extracted text).

I'd do something like:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    },{
      "remove": {
        "field": "data"
      }
    }
  ]
}
PUT my-index-00001/_doc/my_id?pipeline=attachment
{
  "url": "http://server/path/to/file.txt",
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my-index-00001/_doc/my_id

This should give:

{
  "found": true,
  "_index": "my-index-00001",
  "_type": "_doc",
  "_id": "my_id",
  "_version": 1,
  "_seq_no": 22,
  "_primary_term": 1,
  "_source": {
    "url": "http://server/path/to/file.txt",
    "attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem ipsum dolor sit amet",
      "content_length": 28
    }
  }
}

Bertrand_Nau · September 30, 2020, 5:20pm

Hi @dadoonet, thank you for your answer.

Indeed, I have created the same pipeline in order to keep the text content of the PDF without the binary content.

I have an index containing people with some properties (firstname, lastname...). For each people, I can have a resume, and I want to find the people by firstname, lastname, or resume content.

Currently, I have two different indexes, one for the people, and one for the resumes and I don't know how to make the query. I suppose that the modelling of the data should be improved. For instance, people metadata and resume content could be in the same document, but I don't know if it's possible with the ingest-attachment plugin.

dadoonet · September 30, 2020, 7:03pm

yeah. I'd most likely do something like:

PUT my-index-00001/_doc/my_id?pipeline=attachment
{
  "name": "David P",
  "birthdate": "XX/XX/XXXX",
  "country": "France",
  "resume": {
    "url": "http://server/path/to/file.txt",
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  }
}

Another idea is to use the enrich processor to do the lookup at index time. See https://www.elastic.co/guide/en/elasticsearch/reference/current/match-enrich-policy-type.html

Bertrand_Nau · October 1, 2020, 1:25pm

Hi @dadoonet, thanks for your answer. I've tested and it works perfectly!!!

I have one last question regarding this topic. Is it possible to put the attachment without resending the people properties (name, birthdate...)?

Currently, I need to do a GET query before every PUT query in order to retrieve the people properties.

I have checked the Update API, and it's not able to call a pipeline. I have also checked the script processor, but I feel it can't solve it neither.

dadoonet · October 1, 2020, 1:46pm

No. You need to send everything again.

The GET you are doing could be done though by the enrich processor. It would do the lookup at index time. See

In which case, you would only send something like:

PUT my-index-00001/_doc/my_id?pipeline=attachment
{
  "username": "dadoonet",
  "resume": {
    "url": "http://server/path/to/file.txt",
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  }
}

system · October 29, 2020, 1:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How Attachments or file storage and searching is handled in Elasticsearch Elasticsearch	7	1528	August 13, 2020
How to use OCR in Elasticsearch ingest attachment plugin? Elasticsearch ingest-pipeline	12	6174	March 4, 2021
Search froma a pdf file content Elasticsearch	9	484	July 23, 2020
How to specify file to Ingest Attachment Elasticsearch	11	4799	March 21, 2017
PDF- ingest attachement plugin Elasticsearch	2	453	April 3, 2018

Searching through PDF attachments and other documents in ElasticSearch with one query

Related topics