Implementing Ingest Attachment Processor Plugin

The result of the post on simulate comes up as show by you in my enviorment too, but, when checking the index with GET /index/type/id it does not come along the attachement processor.

So, when searching a data with a term/word it does not come with the text... try to run in your example:

GET /index/_search?q=first

This is the result I am getting:

{
  "took": 17,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.5306447,
    "hits": [
      {
        "_index": "test",
        "_type": "type1",
        "_id": "7",
        "_score": 1.5306447,
        "_source": {
          "docs": [
            {
              "_index": "test",
              "_type": "type1",
              "_id": "7",
              "_source": {
                "data": "JVBERi0xLjUNCiW1tbW1DQoxIDAgb2JqDQo8PC9UeXBlL0NhdGFsb2cvUGFnZXMgMiAwIFI"
              }
            }
          ]
        }
      },
      {
        "_index": "test",
        "_type": "type1",
        "_id": "1",
        "_score": 1.4201221,
        "_source": {
          "_content_type": "application/pdf",
          "data": "JVBERi0xLjMNJeLjz9MNCjUyNyAwIG9iag08PCANL0xpbmVhcml6ZWQgMSANL08gNTI5IA0",
          "attachment": {}
        }
      },
      {
        "_index": "test",
        "_type": "type1",
        "_id": "2",
        "_score": 1.4201221,
        "_source": {
          "data": "JVBERi0xLjMNJeLjz9MNCjUyNyAwIG9iag08PCANL0xpbmVhcml6ZWQgMSANL08gNTI5IA0",
          "attachment": {},
          "properties": [
            {
              "content_type": "application/pdf",
              "language": "pt-BR"
            }
          ]
        }
      }
    ]
  }
}

I cut off part of the data because it´s a pdf file, too big encoded text...)

You see, it is not bringing the pdf content, only the encoded.

Please let us know if it worked on your side the referenced search.

Thanks @dadoonet

I got it working! The problem was that when you create a processor, you should inform an existing field of your index mapping.

So, if you have a data field on your map, you can use this field as yous processor field.

If you allow me, I will update the documentation with this information to make it easy for others.

Thanks and best regards!

Hey @dadoonet,

Since my issue is similar to what is being discussed here, I will post here. I can follow up to this point in the thread but I want to be able to search the attachment.content bit that has the plain text rendering of the base64 data. I also want to be able to highlight results with the content field as well.

The error I am getting when I try to index a document with a pipelined attachment:

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [attachment]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse [attachment]",
    "caused_by": {
      "type": "illegal_state_exception",
      "reason": "Can't get text on a START_OBJECT at 1:15"
    }
  },
  "status": 400
}

My mapping:

{
	"mappings": {
		"document": {
			"properties": {
				"contents": {
					"type": "text"
				},
				"title": {
					"type": "text"
				},
				"location": {
					"type": "text"
				},
				"attachment": {
					"type": "text"
				}
			}
		}
	}
}

I am passing the base64 encoded binary via the contents field. My pipeline:

{
  "description": "Process documents",
  "processors": [
    {
      "attachment": {
        "field": "contents",
        "indexed_chars": -1
      }
    }
  ]
}

And finally the document I am passing via the put api.

{
	"contents": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=", 
	"title": "testfile.docx", 
	"location": "righthere"
}

Let me know if any other data is needed. Thank you.

You set up attachment as type text, but when the ingest-attachment processor outputs the attachment it comes as an object. You'll need to delete that mapping. The decoded message text will come in attachment.content

For what it's worth, you can use the _simulate feature of the ingest node to test things out in a really nice way. I'm going to combine the two examples with the following, which you can paste into the Dev Tools in Kibana

POST _ingest/pipeline/_simulate
{
  "pipeline":
  {
    "description" : "shanes pipeline",
    "processors" : [
        {
          "attachment": {
            "field": "message"
          }
        },
        {
          "set": {
            "field": "attachment.title",
            "value": "{{ title }}"
          }
        },
        {
          "remove": { "field": "message" }
        },
        {
          "remove": { "field": "title" }
        }
    ]
  },
  "docs": [
    {
      "_index": "shanetest",
      "_type": "shanetest",
      "_id": "1",
      "_source": {
        "message": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=",
        "title": "testfile.docx",
        "location": "righthere"
      }
    }
  ]
}

Which produces

{
  "docs": [
    {
      "doc": {
        "_index": "shanetest",
        "_id": "1",
        "_type": "shanetest",
        "_source": {
          "attachment": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "et",
            "title": "testfile.docx",
            "content": "testing my first encoded text",
            "content_length": 30
          },
          "location": "righthere"
        },
        "_ingest": {
          "timestamp": "2016-11-03T18:42:40.881+0000"
        }
      }
    }
  ]
}

Then I can see that attachment is an object and that I've successfully moved title to attachment.title and then removed the original message field so I'm not storing the base64 content any more.

1 Like

I am trying percolator query in 5.0 on attachments. I am getting this ERROR: "Attachment fields are not searchable: [message]"
Any information on percolator with attachments would be helpful.

@shradhatx I am with you. I am currently exploring Kibana as mentioned above by @shanec. The issue I am having now is querying the content field in the attachment. I also want to return highlights from the field as well. But since _source is not searchable I dont know how properly set up my query.

So, my issue has been solved. Thanks to @shanec for the help and mentioning Kibana.

For the sake of brevity, my working solution is below. I'm sure this can be optimized, but for now it gets the job done.

DELETE /myindex
PUT /myindex
{
	"mappings": {
		"document": {
			"properties": {
				"thedata": {
					"type": "text"
				},
				"title": {
					"type": "text"
				},
				"location": {
					"type": "text"
				}
			}
		}
	}
}

DELETE _ingest/pipeline/attachment
PUT _ingest/pipeline/attachment
{
  "description": "Process documents",
  "processors": [
    {
      "attachment": {
        "field": "thedata",
        "indexed_chars": -1
      }
    },
    {
      "set": {
        "field": "attachment.title",
        "value": "{{ title }}"
      }
    },
    {
      "set": {
        "field": "attachment.location",
        "value": "{{ location }}"
        }
    },
    {
      "remove": { "field": "thedata" }
    },
    {
      "remove": { "field": "title" }
    },
    {
      "remove": { "field": "location" }
    }
  ]
}

PUT /_bulk?pipeline=attachment
{"index": {"_index": "myindex", "_type" : "document", "_id" : "2" }}
{"thedata": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=", "title": "testfile.docx", "location": "righthere"}


GET /myindex/document/2

GET /myindex/_search
{
	"query": {
	  "match": {
		  "attachment.content": "testing"
    }
  },
	"highlight": {
		"fields": {
			"attachment.content": {
				"fragment_size": 150,
				"number_of_fragments": 3,
				"no_match_size": 150
			}
		}
	}
}

Or alternately you could do this-
DELETE /myindex
PUT /myindex
PUT /myindex/mytype/_mapping
{
"mytype": {
"properties": {
"attachment": {
"properties": {
"content": {
"type": "text",
"term_vector":"with_positions_offsets",
"store": true
}

              }
           }
}
}

}

PUT myindex/mytype/1?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

GET /myindex/mytype/_search
{
"stored_fields": [],
"query": {
"match":{"attachment.content": "ipsum"}
},
"highlight": {
"fields": {
"attachment.content": {
}
}
}
}

Did you install the ingest attachment plugin?

I have a problem too. I'm trying to index a pdf, and I have installed the ingest attachment. But when I try to PUT my pdf, it returns an error that says that content-type [application/pdf] is not supported

Open your own question. This one is too old.

BTW read the documentation and you will see you can't upload directly a PDF.

Which documentation? And how do I open my own question?

And how do I open my own question?

Click on "new topic"

Which documentation?

https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

@dadoonet I have a question regarding attachment data (base64 encoded binary saved on field). Is it possible to exclude that data from elasticsearch? From my perspective it doesn't make sense to keep binary data on the ES because the ES is never the source of truth regarding the data and the files are kept separately somewhere. Is there a possibility to exclude that and save only retrieved content?

Thanks!

Open your own question in #elasticsearch. This one is too old.
But you can link to this one if needed.

I believe you are asking about the mapper-attachments plugin, right?

First, be aware that this plugin has been deprecated in 5.0.0 and is now removed in 6.0.
If you are starting a project, don't use it!

Use ingest-attachment instead: https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

In the pipeline, just add a remove processor to remove the BASE64 content.

In the context of mapper-attachments, you can may be use source exclude feature. https://www.elastic.co/guide/en/elasticsearch/reference/5.5/mapping-source-field.html#include-exclude

Thanks. I was asking about the ingest-attachment. That is everything what I've needed to know.

Hi @dadoonet . I've created separate question about different case - the support of some kind of dynamic fields for ingest processor.

This is fine. I have done till this.
How can I choose a pdf file for indexing?
I mean to say...How will I specify the pdf file path for indexing?

@Sudhanshu_Sekhar_Gou Please open your own question.
This one is too old and should have been closed.

You can link to it in your own post though.

This topic was automatically closed after 3 hours. New replies are no longer allowed.