Implementing Ingest Attachment Processor Plugin


#1

I currently use a Mapper Attachments type in my Elasticsearch 2.3 mapping and I try to migrate to Elasticsearch 5.0.0-beta3.
Unfortunatelly the plugin has been replaced by new Ingest Attachment Processor Plugin, which is not documented.
I read the documentation https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment.html and installed it but my Elasticsearch instance returns an error message when I create my new index mapping.

put /fs-rfi
put /fs-rfi/document/_mapping
{
    "document" : {
        "properties" : {
            "file" : {
                "type" : "attachment",
                "fields" : {
                    "content" : {
                      "type" : "string",
                      "store" : true,
                      "term_vector" : "with_positions_offsets"
                    },
                    "title" : {"store" : "yes"},
                    "date" : {"store" : "yes"},
                    "author" : {"store" : "yes"},
                    "keywords" : {"store" : "yes"},
                    "content_type" : {"store" : "yes"},
                    "content_length" : {"store" : "yes"},
                    "language" : {"store" : "yes"}
                }
            },
            "author" : { "type" : "string", "store" : true, "fields" : { "raw": {"type" : "string", "index" : "not_analyzed"} } },
            "size" : { "type" : "integer", "store" : true },
            "format" : { "type" : "string", "store" : true, "fields" : { "raw": {"type" : "string", "index" : "not_analyzed"} } },
            "mimetype" : { "type" : "string", "store" : true },
            "unc" : { "type" : "string", "store" : true, "fields" : { "raw": {"type" : "string", "index" : "not_analyzed"} } },
            "keywords" : { "type" : "string", "store" : true },
            "language" : { "type" : "string", "store" : true },
            "name" : { "type" : "string", "store" : true, "fields" : { "raw": {"type" : "string", "index" : "not_analyzed"} } },
            "title" : { "type" : "string", "store" : true, "fields" : { "raw": {"type" : "string", "index" : "not_analyzed"} } },
            "lastupdate" : { "type" : "date", "format" : "yyyy-MM-dd HH:mm:ss.SSS", "store" : true },
        "id" : {
            "type" : "string"
          }
        }
    }
}

return message:
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "No handler for type [attachment] declared on field [file]"
}
],
"type": "mapper_parsing_exception",
"reason": "No handler for type [attachment] declared on field [file]"
},
"status": 400
}

Github is not documented.

Any idea?


(Martijn Van Groningen) #2

You have to remove the attachment field type from your mapping, because the field type is part of the attachment mapper plugin which you don't have installed. The attachment processor doesn't work with the mappings. It is part of the ingest framework that via pipelines alters the source before indexing.


(David Pilato) #3

Not technically exact.

Mapper attachment is still here but has been deprecated.
Doc is here.

This is untrue. The documentation is here.


#4

Thank you. I succeded to make my old import process work using deprecated Mapper attachment plugin.
I have to look into the new Attachment Processor plugin documentation since the Mapper attachment plugin is deprecated.
Do you know when the Mapper attachment plugin completely disappear?


(David Pilato) #5

I'd say probably 6.0 but it will depend from the discussion which might happen here.


(evert) #6

Hello @dadoonet,

Can we still upload the attachments using ingest, having the positions offsets setting, as we were used to, as of:

//Mapping...
'my_type' => [
    'file'      => [
        'type'      => 'attachment',
        'fields'    => [
            'content'   => [
                'type'          => 'string',
                'term_vector'   => 'with_positions_offsets',
                'store'         => true
            ]
        ]
     ],
  ]

So we can retrieve the content results and highlight it?

In this case, I am using php plugin.

Thanks!


(David Pilato) #7

Yes you can !


(evert) #8

Nice!

@dadoonet, now I kind of agree with @kruelah... I could not find documentation for that... having a lot of hard time and 18h of work, reasearch without having it to work without old plugin... on the documentation it does not says how to map with the new ingest plugin... Could you give us a light ahead?

Thanks!


(David Pilato) #9

Did you read how ingest works? https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html

And the plugin doc as well? https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

What did you do so far?
Where are you exactly blocked? What is unclear?

May be share a script of what you did so far?

In case it helps, here is a deck about Ingest. https://speakerdeck.com/elastic/ingest-node-re-indexer-et-enrichir-des-documents-dans-elasticsearch-softshake-2016


(evert) #11

Hi @dadoonet,

Thanks for your time and attention.

You were right, even after a lot of reading I was missing some important points. Here are the step by step I could walk through, to get it done:

  1. Install ingest-attachment:

             ./bin/elasticsearch-plugin install ingest-attachment
    
  2. Create my pipeline:

         //Post to /_ingest/pipeline/attachment
         {
            "description" : "Extract attachment information",
            "processors" : [
            {
                "attachment" : {
                "field" : "data"
             }
          }]
         }
    
  3. Map my index without my content filed, which I called data on previous item (pipeline):

     // Using PHP Client I map my index"
         $this->params = [
             'index' => $this->index,
             'type'  => $this->type,
             'body'  => [
                 $this->type => [
                   'properties'    => [
                         'id' => [
                             'type' => 'integer'
                         ],
                         'name' => [
                             'type' => 'string'
                         ],
                         'description' => [
                             'type' => 'string'
                         ],
                         'type' => [
                             'type' => 'string'
                         ],
                         'author' => [
                             'type' => 'string'
                         ],
                         'editor' => [
                             'type' => 'string'
                         ]
                    ]
    
  4. Index some text to my index, without my pdf file, such as file name, type, author etc.

  5. Then, we index the file, as of below:

     // PUT /index/type/my_indexed_id?pipeline=attachment
     {
       "data": "base64_encode('file.pdf')"
     }
    
  6. I got the file indexed... but still I could not get it searched... it seems it is not decoding when gets to elastic ingest....

Could you give us some tip on this issue?

I think we are getting there!

Cheers!


Index PDF in ES
(David Pilato) #12

Can you do a

 GET /index/type/my_indexed_id

#13

Yes I can, and it brings the encoded text. I will set up the enviorment here at work and update this answer so you can see the result. It brings _source with the enconded pdf content.

Thanks!


(David Pilato) #14

So please provide a full recreation script we can use to replay your problem.

See an example here: About the Elasticsearch category


(evert) #16

Hi @dadoonet,

Here is goes, hope it is as expected. If no, please let me know, how I can improve it.

One details when creating my pipeline I had to use "indexed_chars" : -1 in order to accomplish indexing my pdf content.

After installing the ingest-attachment, I create an Index and a pipeline as my last post, and index my first item as of below:

    // Index 'First Book'
    {
        "field1" : "First Book"
    }

Then I index my pdf file content, as of below, using on my POSTMAN header Content-Type=application/pdf:

    // PUT /test/type1/1?pipeline=attachment
    {
       "data" : "MY_BASE_64_ENCODED_PDF_FILE"
    }

I have used PHP encoding and ASP encoding

Which resulted as of:

    {
      "_index": "test",
      "_type": "type1",
      "_id": "1",
      "_version": 2,
      "result": "updated",
      "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
      },
      "created": false
    }

So, I fetch my index, which shows:

    {
      "_index": "test",
      "_type": "type1",
      "_id": "1",
      "_version": 2,
      "found": true,
      "_source": {
        "data": "MY_BASE_64_ENCODED_PDF_FILE"
      }
    }

So, it does not show as the documentation, where shoudl have something like this at the bottom:

    "attachment": {
          "content_type": "application/rtf",
          "language": "ro",
          "content": "Lorem ipsum dolor sit amet",
          "content_length": 28
        }

I am probably missing something... just not sure what neither where... Also, I tried to index a simple base64encoded text, it brings the "attachment" field, but empty, as of below:

    {
      "_index": "test",
      "_type": "type1",
      "_id": "1",
      "_version": 2,
      "found": true,
      "_source": {
        "data": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=",
        "attachment": {}
      }
    }

Thanks again for your help and your time.

Edit 1

I am using elasticsearch image on docker.


(David Pilato) #17

Here is a test which works on my machine:

DELETE /_ingest/pipeline/attachment
PUT /_ingest/pipeline/attachment
{
  "description": "my_pipeline",
  "processors": [
    {
      "attachment": {
        "field": "data"
      }
    }
  ]
}

POST /_ingest/pipeline/attachment/_simulate?pretty
{
  "docs": [ {
    "_index": "index",
    "_type": "type",
    "_id": "id",
    "_source": {
      "data": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ="
    }
  } ]
}

It gives:

{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_id": "id",
        "_type": "type",
        "_source": {
          "data": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=",
          "attachment": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "et",
            "content": "testing my first encoded text",
            "content_length": 30
          }
        },
        "_ingest": {
          "timestamp": "2016-10-31T16:30:03.461+0000"
        }
      }
    }
  ]
}

(evert) #18

The result of the post on simulate comes up as show by you in my enviorment too, but, when checking the index with GET /index/type/id it does not come along the attachement processor.

So, when searching a data with a term/word it does not come with the text... try to run in your example:

GET /index/_search?q=first

This is the result I am getting:

{
  "took": 17,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.5306447,
    "hits": [
      {
        "_index": "test",
        "_type": "type1",
        "_id": "7",
        "_score": 1.5306447,
        "_source": {
          "docs": [
            {
              "_index": "test",
              "_type": "type1",
              "_id": "7",
              "_source": {
                "data": "JVBERi0xLjUNCiW1tbW1DQoxIDAgb2JqDQo8PC9UeXBlL0NhdGFsb2cvUGFnZXMgMiAwIFI"
              }
            }
          ]
        }
      },
      {
        "_index": "test",
        "_type": "type1",
        "_id": "1",
        "_score": 1.4201221,
        "_source": {
          "_content_type": "application/pdf",
          "data": "JVBERi0xLjMNJeLjz9MNCjUyNyAwIG9iag08PCANL0xpbmVhcml6ZWQgMSANL08gNTI5IA0",
          "attachment": {}
        }
      },
      {
        "_index": "test",
        "_type": "type1",
        "_id": "2",
        "_score": 1.4201221,
        "_source": {
          "data": "JVBERi0xLjMNJeLjz9MNCjUyNyAwIG9iag08PCANL0xpbmVhcml6ZWQgMSANL08gNTI5IA0",
          "attachment": {},
          "properties": [
            {
              "content_type": "application/pdf",
              "language": "pt-BR"
            }
          ]
        }
      }
    ]
  }
}

I cut off part of the data because it´s a pdf file, too big encoded text...)

You see, it is not bringing the pdf content, only the encoded.

Please let us know if it worked on your side the referenced search.


(evert) #19

Thanks @dadoonet

I got it working! The problem was that when you create a processor, you should inform an existing field of your index mapping.

So, if you have a data field on your map, you can use this field as yous processor field.

If you allow me, I will update the documentation with this information to make it easy for others.

Thanks and best regards!


(enolam) #21

Hey @dadoonet,

Since my issue is similar to what is being discussed here, I will post here. I can follow up to this point in the thread but I want to be able to search the attachment.content bit that has the plain text rendering of the base64 data. I also want to be able to highlight results with the content field as well.

The error I am getting when I try to index a document with a pipelined attachment:

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [attachment]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse [attachment]",
    "caused_by": {
      "type": "illegal_state_exception",
      "reason": "Can't get text on a START_OBJECT at 1:15"
    }
  },
  "status": 400
}

My mapping:

{
	"mappings": {
		"document": {
			"properties": {
				"contents": {
					"type": "text"
				},
				"title": {
					"type": "text"
				},
				"location": {
					"type": "text"
				},
				"attachment": {
					"type": "text"
				}
			}
		}
	}
}

I am passing the base64 encoded binary via the contents field. My pipeline:

{
  "description": "Process documents",
  "processors": [
    {
      "attachment": {
        "field": "contents",
        "indexed_chars": -1
      }
    }
  ]
}

And finally the document I am passing via the put api.

{
	"contents": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=", 
	"title": "testfile.docx", 
	"location": "righthere"
}

Let me know if any other data is needed. Thank you.


(Shane Connelly) #22

You set up attachment as type text, but when the ingest-attachment processor outputs the attachment it comes as an object. You'll need to delete that mapping. The decoded message text will come in attachment.content

For what it's worth, you can use the _simulate feature of the ingest node to test things out in a really nice way. I'm going to combine the two examples with the following, which you can paste into the Dev Tools in Kibana

POST _ingest/pipeline/_simulate
{
  "pipeline":
  {
    "description" : "shanes pipeline",
    "processors" : [
        {
          "attachment": {
            "field": "message"
          }
        },
        {
          "set": {
            "field": "attachment.title",
            "value": "{{ title }}"
          }
        },
        {
          "remove": { "field": "message" }
        },
        {
          "remove": { "field": "title" }
        }
    ]
  },
  "docs": [
    {
      "_index": "shanetest",
      "_type": "shanetest",
      "_id": "1",
      "_source": {
        "message": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=",
        "title": "testfile.docx",
        "location": "righthere"
      }
    }
  ]
}

Which produces

{
  "docs": [
    {
      "doc": {
        "_index": "shanetest",
        "_id": "1",
        "_type": "shanetest",
        "_source": {
          "attachment": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "et",
            "title": "testfile.docx",
            "content": "testing my first encoded text",
            "content_length": 30
          },
          "location": "righthere"
        },
        "_ingest": {
          "timestamp": "2016-11-03T18:42:40.881+0000"
        }
      }
    }
  ]
}

Then I can see that attachment is an object and that I've successfully moved title to attachment.title and then removed the original message field so I'm not storing the base64 content any more.


(Shradha Bhalla) #23

I am trying percolator query in 5.0 on attachments. I am getting this ERROR: "Attachment fields are not searchable: [message]"
Any information on percolator with attachments would be helpful.