Index PDF with Ingest Attachement Plugin using NodeJS Client

BE-CH · November 10, 2021, 9:51am

I am having a hard time understanding how I can index files such PDF and .RTF files using the Ingest Attachment Processor Plugin and make them searchable.

My main problem seems to be that I can't search for the files after I did the following steps;

First i created a pipeline

client.ingest.putPipeline({
  id: 'attachment',
  body: {
    description: 'Extract attachment information',
    processors: [
      {
        attachment: {
          field: 'data',
        },
      },
    ],
  },
});

Then i inserted my Lorem Ipsum .rtf file

client.index({
  index: 'books',
  pipeline: 'attachment',
  body: {
    data: 'e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=',
  },
});

Lastly i searched for it:

client.search({
  index: 'books',
  body: {
    query: {
      match: { content: "Lorem ipsum" },
    },
  },
});

The problem is that the search returns no matches!

I tried checking if the document was there by getting it by id and it does indeed find it. Even with the attachment data decoded from the bas64 data. See the JSON response below.

"attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem ipsum dolor sit amet",
      "content_length": 28
    }

dadoonet · November 10, 2021, 10:32am

Welcome!

Try:

client.search({
  index: 'books',
  body: {
    query: {
      match: { attachment.content: "Lorem ipsum" },
    },
  },
});

BE-CH · November 10, 2021, 10:47am

Thank you! It works.

Quick follow up question; Am I doing it the intended and correct way? and does the attachment in

match: { attachment.content: "Lorem ipsum" },

refer to the name of my pipeline called attachment?

dadoonet · November 10, 2021, 11:04am

Yes. You are doing it correctly.

attachment refers to the field name in the document:

{
   // ...,
   "attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem ipsum dolor sit amet",
      "content_length": 28
    }
}

This field and the inner fields are generated by the ingest attachment processor which you put in an ingest pipeline named attachment as well but that's a coincidence.

You could define your ingest attachment like this:

client.ingest.putPipeline({
  id: 'foo',
  body: {
    description: 'Extract attachment information',
    processors: [
      {
        attachment: {
          field: 'data',
        },
      },
    ],
  },
});

And use it:

client.index({
  index: 'books',
  pipeline: 'foo',
  body: {
    data: 'e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=',
  },
});

It will still generate the same data structure:

{
   // ...,
   "attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem ipsum dolor sit amet",
      "content_length": 28
    }
}

You can change the default target field name using target_field. See Using the Attachment Processor in a Pipeline | Elasticsearch Plugins and Integrations [7.15] | Elastic

system · December 8, 2021, 11:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch - attachment using Ingest - with node.js Elasticsearch	2	2614	June 21, 2017
Ingest-attachment ingest local docs Elasticsearch	4	453	November 18, 2018
How to search through ingest attachments Elasticsearch	2	386	July 11, 2019
Equivalent node js of ingest-attachment with elasticsearch Elasticsearch	1	697	December 18, 2017
Ingest attachement plugin not able index complex pdf Elasticsearch	6	504	October 24, 2018

Index PDF with Ingest Attachement Plugin using NodeJS Client

Related topics