Implementing Ingest Attachment Processor Plugin

@shradhatx I am with you. I am currently exploring Kibana as mentioned above by @shanec. The issue I am having now is querying the content field in the attachment. I also want to return highlights from the field as well. But since _source is not searchable I dont know how properly set up my query.

So, my issue has been solved. Thanks to @shanec for the help and mentioning Kibana.

For the sake of brevity, my working solution is below. I'm sure this can be optimized, but for now it gets the job done.

DELETE /myindex
PUT /myindex
{
	"mappings": {
		"document": {
			"properties": {
				"thedata": {
					"type": "text"
				},
				"title": {
					"type": "text"
				},
				"location": {
					"type": "text"
				}
			}
		}
	}
}

DELETE _ingest/pipeline/attachment
PUT _ingest/pipeline/attachment
{
  "description": "Process documents",
  "processors": [
    {
      "attachment": {
        "field": "thedata",
        "indexed_chars": -1
      }
    },
    {
      "set": {
        "field": "attachment.title",
        "value": "{{ title }}"
      }
    },
    {
      "set": {
        "field": "attachment.location",
        "value": "{{ location }}"
        }
    },
    {
      "remove": { "field": "thedata" }
    },
    {
      "remove": { "field": "title" }
    },
    {
      "remove": { "field": "location" }
    }
  ]
}

PUT /_bulk?pipeline=attachment
{"index": {"_index": "myindex", "_type" : "document", "_id" : "2" }}
{"thedata": "dGVzdGluZyBteSBmaXJzdCBlbmNvZGVkIHRleHQ=", "title": "testfile.docx", "location": "righthere"}


GET /myindex/document/2

GET /myindex/_search
{
	"query": {
	  "match": {
		  "attachment.content": "testing"
    }
  },
	"highlight": {
		"fields": {
			"attachment.content": {
				"fragment_size": 150,
				"number_of_fragments": 3,
				"no_match_size": 150
			}
		}
	}
}

Or alternately you could do this-
DELETE /myindex
PUT /myindex
PUT /myindex/mytype/_mapping
{
"mytype": {
"properties": {
"attachment": {
"properties": {
"content": {
"type": "text",
"term_vector":"with_positions_offsets",
"store": true
}

              }
           }
}
}

}

PUT myindex/mytype/1?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

GET /myindex/mytype/_search
{
"stored_fields": [],
"query": {
"match":{"attachment.content": "ipsum"}
},
"highlight": {
"fields": {
"attachment.content": {
}
}
}
}

Did you install the ingest attachment plugin?

I have a problem too. I'm trying to index a pdf, and I have installed the ingest attachment. But when I try to PUT my pdf, it returns an error that says that content-type [application/pdf] is not supported

Open your own question. This one is too old.

BTW read the documentation and you will see you can't upload directly a PDF.

Which documentation? And how do I open my own question?

And how do I open my own question?

https://discuss.elastic.co/c/elasticsearch

Click on "new topic"

Which documentation?

@dadoonet I have a question regarding attachment data (base64 encoded binary saved on field). Is it possible to exclude that data from elasticsearch? From my perspective it doesn't make sense to keep binary data on the ES because the ES is never the source of truth regarding the data and the files are kept separately somewhere. Is there a possibility to exclude that and save only retrieved content?

Thanks!

Open your own question in #elasticsearch. This one is too old.
But you can link to this one if needed.

I believe you are asking about the mapper-attachments plugin, right?

First, be aware that this plugin has been deprecated in 5.0.0 and is now removed in 6.0.
If you are starting a project, don't use it!

Use ingest-attachment instead: https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

In the pipeline, just add a remove processor to remove the BASE64 content.

In the context of mapper-attachments, you can may be use source exclude feature. https://www.elastic.co/guide/en/elasticsearch/reference/5.5/mapping-source-field.html#include-exclude

Thanks. I was asking about the ingest-attachment. That is everything what I've needed to know.

Hi @dadoonet . I've created separate question about different case - the support of some kind of dynamic fields for ingest processor.

This is fine. I have done till this.
How can I choose a pdf file for indexing?
I mean to say...How will I specify the pdf file path for indexing?

@Sudhanshu_Sekhar_Gou Please open your own question.
This one is too old and should have been closed.

You can link to it in your own post though.

This topic was automatically closed after 3 hours. New replies are no longer allowed.