Implementing Ingest Attachment Processor Plugin

kruelah · June 9, 2016, 8:35am

I currently use a Mapper Attachments type in my Elasticsearch 2.3 mapping and I try to migrate to Elasticsearch 5.0.0-beta3.
Unfortunatelly the plugin has been replaced by new Ingest Attachment Processor Plugin, which is not documented.
I read the documentation https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment.html and installed it but my Elasticsearch instance returns an error message when I create my new index mapping.

put /fs-rfi
put /fs-rfi/document/_mapping
{
    "document" : {
        "properties" : {
            "file" : {
                "type" : "attachment",
                "fields" : {
                    "content" : {
                      "type" : "string",
                      "store" : true,
                      "term_vector" : "with_positions_offsets"
                    },
                    "title" : {"store" : "yes"},
                    "date" : {"store" : "yes"},
                    "author" : {"store" : "yes"},
                    "keywords" : {"store" : "yes"},
                    "content_type" : {"store" : "yes"},
                    "content_length" : {"store" : "yes"},
                    "language" : {"store" : "yes"}
                }
            },
            "author" : { "type" : "string", "store" : true, "fields" : { "raw": {"type" : "string", "index" : "not_analyzed"} } },
            "size" : { "type" : "integer", "store" : true },
            "format" : { "type" : "string", "store" : true, "fields" : { "raw": {"type" : "string", "index" : "not_analyzed"} } },
            "mimetype" : { "type" : "string", "store" : true },
            "unc" : { "type" : "string", "store" : true, "fields" : { "raw": {"type" : "string", "index" : "not_analyzed"} } },
            "keywords" : { "type" : "string", "store" : true },
            "language" : { "type" : "string", "store" : true },
            "name" : { "type" : "string", "store" : true, "fields" : { "raw": {"type" : "string", "index" : "not_analyzed"} } },
            "title" : { "type" : "string", "store" : true, "fields" : { "raw": {"type" : "string", "index" : "not_analyzed"} } },
            "lastupdate" : { "type" : "date", "format" : "yyyy-MM-dd HH:mm:ss.SSS", "store" : true },
        "id" : {
            "type" : "string"
          }
        }
    }
}

return message:
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "No handler for type [attachment] declared on field [file]"
}
],
"type": "mapper_parsing_exception",
"reason": "No handler for type [attachment] declared on field [file]"
},
"status": 400
}

Github is not documented.

Any idea?

mvg · June 9, 2016, 8:52am

You have to remove the attachment field type from your mapping, because the field type is part of the attachment mapper plugin which you don't have installed. The attachment processor doesn't work with the mappings. It is part of the ingest framework that via pipelines alters the source before indexing.

dadoonet · June 9, 2016, 9:39am

Not technically exact.

Mapper attachment is still here but has been deprecated.
Doc is here.

This is untrue. The documentation is here.

kruelah · June 13, 2016, 12:06pm

Thank you. I succeded to make my old import process work using deprecated Mapper attachment plugin.
I have to look into the new Attachment Processor plugin documentation since the Mapper attachment plugin is deprecated.
Do you know when the Mapper attachment plugin completely disappear?

dadoonet · June 13, 2016, 1:10pm

I'd say probably 6.0 but it will depend from the discussion which might happen here.

evert · October 30, 2016, 2:44pm

Hello @dadoonet,

Can we still upload the attachments using ingest, having the positions offsets setting, as we were used to, as of:

//Mapping...
'my_type' => [
    'file'      => [
        'type'      => 'attachment',
        'fields'    => [
            'content'   => [
                'type'          => 'string',
                'term_vector'   => 'with_positions_offsets',
                'store'         => true
            ]
        ]
     ],
  ]

So we can retrieve the content results and highlight it?

In this case, I am using php plugin.

Thanks!

dadoonet · October 30, 2016, 4:15pm

Yes you can !

evert · October 30, 2016, 4:39pm

Nice!

@dadoonet, now I kind of agree with @kruelah... I could not find documentation for that... having a lot of hard time and 18h of work, reasearch without having it to work without old plugin... on the documentation it does not says how to map with the new ingest plugin... Could you give us a light ahead?

Thanks!

dadoonet · October 30, 2016, 5:26pm

Did you read how ingest works? https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html

And the plugin doc as well? https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

What did you do so far?
Where are you exactly blocked? What is unclear?

May be share a script of what you did so far?

In case it helps, here is a deck about Ingest. https://speakerdeck.com/elastic/ingest-node-re-indexer-et-enrichir-des-documents-dans-elasticsearch-softshake-2016

evert · October 31, 2016, 12:20am

Hi @dadoonet,

Thanks for your time and attention.

You were right, even after a lot of reading I was missing some important points. Here are the step by step I could walk through, to get it done:

Install ingest-attachment:

         ./bin/elasticsearch-plugin install ingest-attachment

Create my pipeline:

     //Post to /_ingest/pipeline/attachment
     {
        "description" : "Extract attachment information",
        "processors" : [
        {
            "attachment" : {
            "field" : "data"
         }
      }]
     }

Map my index without my content filed, which I called data on previous item (pipeline):

 // Using PHP Client I map my index"
     $this->params = [
         'index' => $this->index,
         'type'  => $this->type,
         'body'  => [
             $this->type => [
               'properties'    => [
                     'id' => [
                         'type' => 'integer'
                     ],
                     'name' => [
                         'type' => 'string'
                     ],
                     'description' => [
                         'type' => 'string'
                     ],
                     'type' => [
                         'type' => 'string'
                     ],
                     'author' => [
                         'type' => 'string'
                     ],
                     'editor' => [
                         'type' => 'string'
                     ]
                ]

Index some text to my index, without my pdf file, such as file name, type, author etc.

Then, we index the file, as of below:

 // PUT /index/type/my_indexed_id?pipeline=attachment
 {
   "data": "base64_encode('file.pdf')"
 }

I got the file indexed... but still I could not get it searched... it seems it is not decoding when gets to elastic ingest....

Could you give us some tip on this issue?

I think we are getting there!

Cheers!

dadoonet · October 31, 2016, 12:43am

Can you do a

 GET /index/type/my_indexed_id

evertramos · October 31, 2016, 10:16am

Yes I can, and it brings the encoded text. I will set up the enviorment here at work and update this answer so you can see the result. It brings _source with the enconded pdf content.

Thanks!

dadoonet · October 31, 2016, 11:48am

So please provide a full recreation script we can use to replay your problem.

See an example here: About the Elasticsearch category

evert · October 31, 2016, 2:06pm

Hi @dadoonet,

Here is goes, hope it is as expected. If no, please let me know, how I can improve it.

One details when creating my pipeline I had to use "indexed_chars" : -1 in order to accomplish indexing my pdf content.

After installing the ingest-attachment, I create an Index and a pipeline as my last post, and index my first item as of below:

    // Index 'First Book'
    {
        "field1" : "First Book"
    }

Then I index my pdf file content, as of below, using on my POSTMAN header Content-Type=application/pdf:

    // PUT /test/type1/1?pipeline=attachment
    {
       "data" : "MY_BASE_64_ENCODED_PDF_FILE"
    }

I have used PHP encoding and ASP encoding

Which resulted as of:

    {
      "_index": "test",
      "_type": "type1",
      "_id": "1",
      "_version": 2,
      "result": "updated",
      "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
      },
      "created": false
    }

So, I fetch my index, which shows:

    {
      "_index": "test",
      "_type": "type1",
      "_id": "1",
      "_version": 2,
      "found": true,
      "_source": {
        "data": "MY_BASE_64_ENCODED_PDF_FILE"
      }
    }

So, it does not show as the documentation, where shoudl have something like this at the bottom:

    "attachment": {
          "content_type": "application/rtf",
          "language": "ro",
          "content": "Lorem ipsum dolor sit amet",
          "content_length": 28
        }

I am probably missing something... just not sure what neither where... Also, I tried to index a simple base64encoded text, it brings the "attachment" field, but empty, as of below:

    {
      "_index": "test",
      "_type": "type1",
      "_id": "1",
      "_version": 2,
      "found": true,
      "_source": {
        "data": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=",
        "attachment": {}
      }
    }

Thanks again for your help and your time.

Edit 1

I am using elasticsearch image on docker.

dadoonet · October 31, 2016, 4:31pm

Here is a test which works on my machine:

DELETE /_ingest/pipeline/attachment
PUT /_ingest/pipeline/attachment
{
  "description": "my_pipeline",
  "processors": [
    {
      "attachment": {
        "field": "data"
      }
    }
  ]
}

POST /_ingest/pipeline/attachment/_simulate?pretty
{
  "docs": [ {
    "_index": "index",
    "_type": "type",
    "_id": "id",
    "_source": {
      "data": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ="
    }
  } ]
}

It gives:

{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_id": "id",
        "_type": "type",
        "_source": {
          "data": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=",
          "attachment": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "et",
            "content": "testing my first encoded text",
            "content_length": 30
          }
        },
        "_ingest": {
          "timestamp": "2016-10-31T16:30:03.461+0000"
        }
      }
    }
  ]
}

evert · October 31, 2016, 6:30pm

The result of the post on simulate comes up as show by you in my enviorment too, but, when checking the index with GET /index/type/id it does not come along the attachement processor.

So, when searching a data with a term/word it does not come with the text... try to run in your example:

GET /index/_search?q=first

This is the result I am getting:

{
  "took": 17,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.5306447,
    "hits": [
      {
        "_index": "test",
        "_type": "type1",
        "_id": "7",
        "_score": 1.5306447,
        "_source": {
          "docs": [
            {
              "_index": "test",
              "_type": "type1",
              "_id": "7",
              "_source": {
                "data": "JVBERi0xLjUNCiW1tbW1DQoxIDAgb2JqDQo8PC9UeXBlL0NhdGFsb2cvUGFnZXMgMiAwIFI"
              }
            }
          ]
        }
      },
      {
        "_index": "test",
        "_type": "type1",
        "_id": "1",
        "_score": 1.4201221,
        "_source": {
          "_content_type": "application/pdf",
          "data": "JVBERi0xLjMNJeLjz9MNCjUyNyAwIG9iag08PCANL0xpbmVhcml6ZWQgMSANL08gNTI5IA0",
          "attachment": {}
        }
      },
      {
        "_index": "test",
        "_type": "type1",
        "_id": "2",
        "_score": 1.4201221,
        "_source": {
          "data": "JVBERi0xLjMNJeLjz9MNCjUyNyAwIG9iag08PCANL0xpbmVhcml6ZWQgMSANL08gNTI5IA0",
          "attachment": {},
          "properties": [
            {
              "content_type": "application/pdf",
              "language": "pt-BR"
            }
          ]
        }
      }
    ]
  }
}

I cut off part of the data because it´s a pdf file, too big encoded text...)

You see, it is not bringing the pdf content, only the encoded.

Please let us know if it worked on your side the referenced search.

evert · November 2, 2016, 7:46pm

Thanks @dadoonet

I got it working! The problem was that when you create a processor, you should inform an existing field of your index mapping.

So, if you have a data field on your map, you can use this field as yous processor field.

If you allow me, I will update the documentation with this information to make it easy for others.

Thanks and best regards!

enolam · November 3, 2016, 6:18pm

Hey @dadoonet,

Since my issue is similar to what is being discussed here, I will post here. I can follow up to this point in the thread but I want to be able to search the attachment.content bit that has the plain text rendering of the base64 data. I also want to be able to highlight results with the content field as well.

The error I am getting when I try to index a document with a pipelined attachment:

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [attachment]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse [attachment]",
    "caused_by": {
      "type": "illegal_state_exception",
      "reason": "Can't get text on a START_OBJECT at 1:15"
    }
  },
  "status": 400
}

My mapping:

{
	"mappings": {
		"document": {
			"properties": {
				"contents": {
					"type": "text"
				},
				"title": {
					"type": "text"
				},
				"location": {
					"type": "text"
				},
				"attachment": {
					"type": "text"
				}
			}
		}
	}
}

I am passing the base64 encoded binary via the contents field. My pipeline:

{
  "description": "Process documents",
  "processors": [
    {
      "attachment": {
        "field": "contents",
        "indexed_chars": -1
      }
    }
  ]
}

And finally the document I am passing via the put api.

{
	"contents": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=", 
	"title": "testfile.docx", 
	"location": "righthere"
}

Let me know if any other data is needed. Thank you.

shanec · November 3, 2016, 6:48pm

You set up attachment as type text, but when the ingest-attachment processor outputs the attachment it comes as an object. You'll need to delete that mapping. The decoded message text will come in attachment.content

For what it's worth, you can use the _simulate feature of the ingest node to test things out in a really nice way. I'm going to combine the two examples with the following, which you can paste into the Dev Tools in Kibana

POST _ingest/pipeline/_simulate
{
  "pipeline":
  {
    "description" : "shanes pipeline",
    "processors" : [
        {
          "attachment": {
            "field": "message"
          }
        },
        {
          "set": {
            "field": "attachment.title",
            "value": "{{ title }}"
          }
        },
        {
          "remove": { "field": "message" }
        },
        {
          "remove": { "field": "title" }
        }
    ]
  },
  "docs": [
    {
      "_index": "shanetest",
      "_type": "shanetest",
      "_id": "1",
      "_source": {
        "message": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=",
        "title": "testfile.docx",
        "location": "righthere"
      }
    }
  ]
}

Which produces

{
  "docs": [
    {
      "doc": {
        "_index": "shanetest",
        "_id": "1",
        "_type": "shanetest",
        "_source": {
          "attachment": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "et",
            "title": "testfile.docx",
            "content": "testing my first encoded text",
            "content_length": 30
          },
          "location": "righthere"
        },
        "_ingest": {
          "timestamp": "2016-11-03T18:42:40.881+0000"
        }
      }
    }
  ]
}

Then I can see that attachment is an object and that I've successfully moved title to attachment.title and then removed the original message field so I'm not storing the base64 content any more.

shradhatx · November 3, 2016, 7:32pm

I am trying percolator query in 5.0 on attachments. I am getting this ERROR: "Attachment fields are not searchable: [message]"
Any information on percolator with attachments would be helpful.

Topic		Replies	Views
No handler for type [attachment] declared on field [my_attachment] Elasticsearch	13	2719	July 22, 2018
Not able to search through attachment contents Elasticsearch	31	8132	January 3, 2017
Attachments Plugin Not Parsing Files Elasticsearch	4	477	November 7, 2014
Ingest attachment plugin, 2 fields with content - one encoded and one decoded Elasticsearch	4	1146	September 13, 2017
Index PDF with Ingest Attachement Plugin using NodeJS Client Elasticsearch	3	704	November 10, 2021

Implementing Ingest Attachment Processor Plugin

Related topics