Index PDF with Ingest Attachement Plugin using NodeJS Client

I am having a hard time understanding how I can index files such PDF and .RTF files using the Ingest Attachment Processor Plugin and make them searchable.

My main problem seems to be that I can't search for the files after I did the following steps;

First i created a pipeline

client.ingest.putPipeline({
  id: 'attachment',
  body: {
    description: 'Extract attachment information',
    processors: [
      {
        attachment: {
          field: 'data',
        },
      },
    ],
  },
});

Then i inserted my Lorem Ipsum .rtf file

client.index({
  index: 'books',
  pipeline: 'attachment',
  body: {
    data: 'e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=',
  },
});

Lastly i searched for it:

client.search({
  index: 'books',
  body: {
    query: {
      match: { content: "Lorem ipsum" },
    },
  },
});

The problem is that the search returns no matches! :cry:

I tried checking if the document was there by getting it by id and it does indeed find it. Even with the attachment data decoded from the bas64 data. See the JSON response below.

"attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem ipsum dolor sit amet",
      "content_length": 28
    }

Welcome!

Try:

client.search({
  index: 'books',
  body: {
    query: {
      match: { attachment.content: "Lorem ipsum" },
    },
  },
});

Thank you! It works.

Quick follow up question; Am I doing it the intended and correct way? and does the attachment in

match: { attachment.content: "Lorem ipsum" },

refer to the name of my pipeline called attachment?

Yes. You are doing it correctly.

attachment refers to the field name in the document:

{
   // ...,
   "attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem ipsum dolor sit amet",
      "content_length": 28
    }
}

This field and the inner fields are generated by the ingest attachment processor which you put in an ingest pipeline named attachment as well but that's a coincidence.

You could define your ingest attachment like this:

client.ingest.putPipeline({
  id: 'foo',
  body: {
    description: 'Extract attachment information',
    processors: [
      {
        attachment: {
          field: 'data',
        },
      },
    ],
  },
});

And use it:

client.index({
  index: 'books',
  pipeline: 'foo',
  body: {
    data: 'e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=',
  },
});

It will still generate the same data structure:

{
   // ...,
   "attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem ipsum dolor sit amet",
      "content_length": 28
    }
}

You can change the default target field name using target_field. See Using the Attachment Processor in a Pipeline | Elasticsearch Plugins and Integrations [7.15] | Elastic

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.