Implementing Ingest Attachment Processor Plugin


(evert) #8

Nice!

@dadoonet, now I kind of agree with @kruelah... I could not find documentation for that... having a lot of hard time and 18h of work, reasearch without having it to work without old plugin... on the documentation it does not says how to map with the new ingest plugin... Could you give us a light ahead?

Thanks!


(David Pilato) #9

Did you read how ingest works? https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html

And the plugin doc as well? https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

What did you do so far?
Where are you exactly blocked? What is unclear?

May be share a script of what you did so far?

In case it helps, here is a deck about Ingest. https://speakerdeck.com/elastic/ingest-node-re-indexer-et-enrichir-des-documents-dans-elasticsearch-softshake-2016


(evert) #11

Hi @dadoonet,

Thanks for your time and attention.

You were right, even after a lot of reading I was missing some important points. Here are the step by step I could walk through, to get it done:

  1. Install ingest-attachment:

             ./bin/elasticsearch-plugin install ingest-attachment
    
  2. Create my pipeline:

         //Post to /_ingest/pipeline/attachment
         {
            "description" : "Extract attachment information",
            "processors" : [
            {
                "attachment" : {
                "field" : "data"
             }
          }]
         }
    
  3. Map my index without my content filed, which I called data on previous item (pipeline):

     // Using PHP Client I map my index"
         $this->params = [
             'index' => $this->index,
             'type'  => $this->type,
             'body'  => [
                 $this->type => [
                   'properties'    => [
                         'id' => [
                             'type' => 'integer'
                         ],
                         'name' => [
                             'type' => 'string'
                         ],
                         'description' => [
                             'type' => 'string'
                         ],
                         'type' => [
                             'type' => 'string'
                         ],
                         'author' => [
                             'type' => 'string'
                         ],
                         'editor' => [
                             'type' => 'string'
                         ]
                    ]
    
  4. Index some text to my index, without my pdf file, such as file name, type, author etc.

  5. Then, we index the file, as of below:

     // PUT /index/type/my_indexed_id?pipeline=attachment
     {
       "data": "base64_encode('file.pdf')"
     }
    
  6. I got the file indexed... but still I could not get it searched... it seems it is not decoding when gets to elastic ingest....

Could you give us some tip on this issue?

I think we are getting there!

Cheers!


Index PDF in ES
(David Pilato) #12

Can you do a

 GET /index/type/my_indexed_id

#13

Yes I can, and it brings the encoded text. I will set up the enviorment here at work and update this answer so you can see the result. It brings _source with the enconded pdf content.

Thanks!


(David Pilato) #14

So please provide a full recreation script we can use to replay your problem.

See an example here: About the Elasticsearch category


(evert) #16

Hi @dadoonet,

Here is goes, hope it is as expected. If no, please let me know, how I can improve it.

One details when creating my pipeline I had to use "indexed_chars" : -1 in order to accomplish indexing my pdf content.

After installing the ingest-attachment, I create an Index and a pipeline as my last post, and index my first item as of below:

    // Index 'First Book'
    {
        "field1" : "First Book"
    }

Then I index my pdf file content, as of below, using on my POSTMAN header Content-Type=application/pdf:

    // PUT /test/type1/1?pipeline=attachment
    {
       "data" : "MY_BASE_64_ENCODED_PDF_FILE"
    }

I have used PHP encoding and ASP encoding

Which resulted as of:

    {
      "_index": "test",
      "_type": "type1",
      "_id": "1",
      "_version": 2,
      "result": "updated",
      "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
      },
      "created": false
    }

So, I fetch my index, which shows:

    {
      "_index": "test",
      "_type": "type1",
      "_id": "1",
      "_version": 2,
      "found": true,
      "_source": {
        "data": "MY_BASE_64_ENCODED_PDF_FILE"
      }
    }

So, it does not show as the documentation, where shoudl have something like this at the bottom:

    "attachment": {
          "content_type": "application/rtf",
          "language": "ro",
          "content": "Lorem ipsum dolor sit amet",
          "content_length": 28
        }

I am probably missing something... just not sure what neither where... Also, I tried to index a simple base64encoded text, it brings the "attachment" field, but empty, as of below:

    {
      "_index": "test",
      "_type": "type1",
      "_id": "1",
      "_version": 2,
      "found": true,
      "_source": {
        "data": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=",
        "attachment": {}
      }
    }

Thanks again for your help and your time.

Edit 1

I am using elasticsearch image on docker.


(David Pilato) #17

Here is a test which works on my machine:

DELETE /_ingest/pipeline/attachment
PUT /_ingest/pipeline/attachment
{
  "description": "my_pipeline",
  "processors": [
    {
      "attachment": {
        "field": "data"
      }
    }
  ]
}

POST /_ingest/pipeline/attachment/_simulate?pretty
{
  "docs": [ {
    "_index": "index",
    "_type": "type",
    "_id": "id",
    "_source": {
      "data": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ="
    }
  } ]
}

It gives:

{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_id": "id",
        "_type": "type",
        "_source": {
          "data": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=",
          "attachment": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "et",
            "content": "testing my first encoded text",
            "content_length": 30
          }
        },
        "_ingest": {
          "timestamp": "2016-10-31T16:30:03.461+0000"
        }
      }
    }
  ]
}

(evert) #18

The result of the post on simulate comes up as show by you in my enviorment too, but, when checking the index with GET /index/type/id it does not come along the attachement processor.

So, when searching a data with a term/word it does not come with the text... try to run in your example:

GET /index/_search?q=first

This is the result I am getting:

{
  "took": 17,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.5306447,
    "hits": [
      {
        "_index": "test",
        "_type": "type1",
        "_id": "7",
        "_score": 1.5306447,
        "_source": {
          "docs": [
            {
              "_index": "test",
              "_type": "type1",
              "_id": "7",
              "_source": {
                "data": "JVBERi0xLjUNCiW1tbW1DQoxIDAgb2JqDQo8PC9UeXBlL0NhdGFsb2cvUGFnZXMgMiAwIFI"
              }
            }
          ]
        }
      },
      {
        "_index": "test",
        "_type": "type1",
        "_id": "1",
        "_score": 1.4201221,
        "_source": {
          "_content_type": "application/pdf",
          "data": "JVBERi0xLjMNJeLjz9MNCjUyNyAwIG9iag08PCANL0xpbmVhcml6ZWQgMSANL08gNTI5IA0",
          "attachment": {}
        }
      },
      {
        "_index": "test",
        "_type": "type1",
        "_id": "2",
        "_score": 1.4201221,
        "_source": {
          "data": "JVBERi0xLjMNJeLjz9MNCjUyNyAwIG9iag08PCANL0xpbmVhcml6ZWQgMSANL08gNTI5IA0",
          "attachment": {},
          "properties": [
            {
              "content_type": "application/pdf",
              "language": "pt-BR"
            }
          ]
        }
      }
    ]
  }
}

I cut off part of the data because it´s a pdf file, too big encoded text...)

You see, it is not bringing the pdf content, only the encoded.

Please let us know if it worked on your side the referenced search.


(evert) #19

Thanks @dadoonet

I got it working! The problem was that when you create a processor, you should inform an existing field of your index mapping.

So, if you have a data field on your map, you can use this field as yous processor field.

If you allow me, I will update the documentation with this information to make it easy for others.

Thanks and best regards!


(enolam) #21

Hey @dadoonet,

Since my issue is similar to what is being discussed here, I will post here. I can follow up to this point in the thread but I want to be able to search the attachment.content bit that has the plain text rendering of the base64 data. I also want to be able to highlight results with the content field as well.

The error I am getting when I try to index a document with a pipelined attachment:

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [attachment]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse [attachment]",
    "caused_by": {
      "type": "illegal_state_exception",
      "reason": "Can't get text on a START_OBJECT at 1:15"
    }
  },
  "status": 400
}

My mapping:

{
	"mappings": {
		"document": {
			"properties": {
				"contents": {
					"type": "text"
				},
				"title": {
					"type": "text"
				},
				"location": {
					"type": "text"
				},
				"attachment": {
					"type": "text"
				}
			}
		}
	}
}

I am passing the base64 encoded binary via the contents field. My pipeline:

{
  "description": "Process documents",
  "processors": [
    {
      "attachment": {
        "field": "contents",
        "indexed_chars": -1
      }
    }
  ]
}

And finally the document I am passing via the put api.

{
	"contents": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=", 
	"title": "testfile.docx", 
	"location": "righthere"
}

Let me know if any other data is needed. Thank you.


(Shane Connelly) #22

You set up attachment as type text, but when the ingest-attachment processor outputs the attachment it comes as an object. You'll need to delete that mapping. The decoded message text will come in attachment.content

For what it's worth, you can use the _simulate feature of the ingest node to test things out in a really nice way. I'm going to combine the two examples with the following, which you can paste into the Dev Tools in Kibana

POST _ingest/pipeline/_simulate
{
  "pipeline":
  {
    "description" : "shanes pipeline",
    "processors" : [
        {
          "attachment": {
            "field": "message"
          }
        },
        {
          "set": {
            "field": "attachment.title",
            "value": "{{ title }}"
          }
        },
        {
          "remove": { "field": "message" }
        },
        {
          "remove": { "field": "title" }
        }
    ]
  },
  "docs": [
    {
      "_index": "shanetest",
      "_type": "shanetest",
      "_id": "1",
      "_source": {
        "message": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=",
        "title": "testfile.docx",
        "location": "righthere"
      }
    }
  ]
}

Which produces

{
  "docs": [
    {
      "doc": {
        "_index": "shanetest",
        "_id": "1",
        "_type": "shanetest",
        "_source": {
          "attachment": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "et",
            "title": "testfile.docx",
            "content": "testing my first encoded text",
            "content_length": 30
          },
          "location": "righthere"
        },
        "_ingest": {
          "timestamp": "2016-11-03T18:42:40.881+0000"
        }
      }
    }
  ]
}

Then I can see that attachment is an object and that I've successfully moved title to attachment.title and then removed the original message field so I'm not storing the base64 content any more.


(Shradha Bhalla) #23

I am trying percolator query in 5.0 on attachments. I am getting this ERROR: "Attachment fields are not searchable: [message]"
Any information on percolator with attachments would be helpful.


(enolam) #24

@shradhatx I am with you. I am currently exploring Kibana as mentioned above by @shanec. The issue I am having now is querying the content field in the attachment. I also want to return highlights from the field as well. But since _source is not searchable I dont know how properly set up my query.


(enolam) #25

So, my issue has been solved. Thanks to @shanec for the help and mentioning Kibana.

For the sake of brevity, my working solution is below. I'm sure this can be optimized, but for now it gets the job done.

DELETE /myindex
PUT /myindex
{
	"mappings": {
		"document": {
			"properties": {
				"thedata": {
					"type": "text"
				},
				"title": {
					"type": "text"
				},
				"location": {
					"type": "text"
				}
			}
		}
	}
}

DELETE _ingest/pipeline/attachment
PUT _ingest/pipeline/attachment
{
  "description": "Process documents",
  "processors": [
    {
      "attachment": {
        "field": "thedata",
        "indexed_chars": -1
      }
    },
    {
      "set": {
        "field": "attachment.title",
        "value": "{{ title }}"
      }
    },
    {
      "set": {
        "field": "attachment.location",
        "value": "{{ location }}"
        }
    },
    {
      "remove": { "field": "thedata" }
    },
    {
      "remove": { "field": "title" }
    },
    {
      "remove": { "field": "location" }
    }
  ]
}

PUT /_bulk?pipeline=attachment
{"index": {"_index": "myindex", "_type" : "document", "_id" : "2" }}
{"thedata": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=", "title": "testfile.docx", "location": "righthere"}


GET /myindex/document/2

GET /myindex/_search
{
	"query": {
	  "match": {
		  "attachment.content": "testing"
    }
  },
	"highlight": {
		"fields": {
			"attachment.content": {
				"fragment_size": 150,
				"number_of_fragments": 3,
				"no_match_size": 150
			}
		}
	}
}

Set processor in NEST
(Shradha Bhalla) #26

Or alternately you could do this-
DELETE /myindex
PUT /myindex
PUT /myindex/mytype/_mapping
{
"mytype": {
"properties": {
"attachment": {
"properties": {
"content": {
"type": "text",
"term_vector":"with_positions_offsets",
"store": true
}

              }
           }
}
}

}

PUT myindex/mytype/1?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

GET /myindex/mytype/_search
{
"stored_fields": [],
"query": {
"match":{"attachment.content": "ipsum"}
},
"highlight": {
"fields": {
"attachment.content": {
}
}
}
}


(evert) #28

Did you install the ingest attachment plugin?


(Celia Tang) #30

I have a problem too. I'm trying to index a pdf, and I have installed the ingest attachment. But when I try to PUT my pdf, it returns an error that says that content-type [application/pdf] is not supported


(David Pilato) #31

Open your own question. This one is too old.

BTW read the documentation and you will see you can't upload directly a PDF.


(Celia Tang) #32

Which documentation? And how do I open my own question?