Can I conditionally trigger a processor based on the current nested iteration or execute a plugin in script?

Is there a way to process an attachment conditionally for a nested object that's represented by an array?

I looked into the foreach processor with a conditional if, but the ctx object does not appear to hold a reference to the current array position, so I cannot find a way to write the if in a way that's valid.

I've also been playing around with the script processor to see if I could find to loop through the property and conditional process each item, but I can't find a way to access the plugin logic from the painless script.

Here's a simplified version of the index I'm using:

{
	"mappings": {
		"properties": {
			"title": { "type": "text" },
			"keywords": { "type": "keyword" },
			"article": {
				"type": "nested",
				"properties": {
					"id": {
						"type": "text"
					},
					"content": {
						"type": "text"
					},
					"type": {
						"type": "text"
					}
				}
			}
		}
	},
	"settings": {
		"number_of_shards": 1,
		"number_of_replicas": 2
	}
}

An article document might look like this:

{
	"title"="My article"
	, "keywords"=[]
	, "article": [
		  {type: "text", "id": "basic", content: "Some text"}
		, {type: "attachment", "id": "file-id-1", content: "base64 encoded data..."}
		, {type: "attachment", "id": "file-id-2", content: "base64 encoded data..."}
	]
}

My goal is to process the article nested object, but only process the array items that have a type equal to attachment.

My first attempt, was to add a conditional statement to my pipeline:

{
	"description" : "Extract attachment information from arrays",
	"processors" : [
		{
			"foreach": {
				"field": "article",
				"processor": {
					"attachment": {
						  "if": "ctx.article.type == \"attachment\""
						, "target_field": "_ingest._value.content"
						, "field": "_ingest._value.content"
					}
				}
			}
		}
	]
}

However, the ctx variable holds the root document object and I cannot figure out a way to access the current index in the article collection.

Next, I tried implementing a script, but I cannot figure out how to programmatically trigger the attachment process. The follow code was the basic code I was working off and it does successfully remove the content key for attachments, but what I want to do is replace the base64 data with the extract text.

{
	"description" : "Extract attachment information from arrays"
	, "processors" : [
		{
			"foreach": {
				"field": "article",
				"processor": {
					"script": {
						  "lang": "painless"
						, "source": #serializeJSON('
							ctx.article.stream()
								.filter(x -> x.type == "attachment")
								.forEach(x -> x.remove("content"))
							;
						')#
					}
				}
			}
		}
	]
}

Is there a way to do what I want?

In case no one has a solution to this problem, what I'm doing currently as an intermittent step is to send attachments to a simulated pipeline to extract the text and then using those results when updating the document. This of course requires multiple steps, but I can send the attachments in bulk.

So here's an example of what I'm sending:

POST /_ingest/pipeline/_simulate`
{
	"pipeline" :
	{
		"description" : "Extract attachment information from arrays",
		"processors" : [
			{
				"foreach": {
					"field": "attachments",
					"processor": {
						"attachment": {
							"target_field": "_ingest._value.attachment",
							"field": "_ingest._value.data"
						}
					}
				}
			}
			, {
				"foreach": {
					"field": "attachments",
					"processor": {
						"remove": {
							"field": "_ingest._value.data"
						}
					}
				}
			}
		]
	},
	"docs": [{
		"_index": "tmp_attachment_pipeline_simulation",
		"_id": "a92cb4de963529804557f465570fab9d",
		"_source": {
			"attachments": [
				  { "filepath":"/some/path/to/file1", "data":"c29tZSBleGFtcGxlIHRleHQ=" }
				, { "filepath":"/some/path/to/file2", "data":"c29tZSBleGFtcGxlIHRleHQ=" }
			]
		}
	}]
}		

You would just replace the attachments array with the input you want to send and then process the results to extract the content an use it in your document model.

I'm still looking for a way to do this in a single step, but for time being this works.

Anyone have a better solution?

Here's something a pipeline that does what I want, but I don't like having to fall back to failing when the content key isn't base64. I'd rather just conditionally process it instead (plus it's theoretically possible the content could happen to be a valid base64 string, which would confuse things.)

Is there anyway to get the if property on the foreach or attachment process to see the current iteration value, so that I could just skip if the current article type was not attachment the processor is skipped?

{
	"pipeline" :
	{
		"description" : "Extract attachment information from arrays",
		"processors" : [
			{
				"foreach": {
					"field": "article",
					"processor": {
						"attachment": {
							  "field": "_ingest._value.content"
							, "target_field": "_ingest._value.attachment"
							, "ignore_failure": true
						}
					}
				}
			}
			, {
				"script": {
						"lang": "painless"
					, "source": "ctx.article.stream().filter(x -> x.type == \"attachment\").forEach(x -> { x.content = x.attachment?.content; x.remove(\"attachment\"); });"
				}
			}
		]
	}
	, "docs": [{
		"_index": "tmp_test",
		"_id": "a92cb4de963529804557f465570fab9d",
		"_source": {
				"title": "My Title"
			, "keywords": []
			, "article": [
				{
					"type" : "text",
					"title" : "Basic",
					"id" : "basic",
					"content" : "This is some text."
				},
				{
					"type" : "text",
					"title" : "Intermediate",
					"id" : "intermediate",
					"content" : "Some secondary content"
				},
				{
					"type" : "attachment",
					"title" : "the_file_name.pdf",
					"id" : "file-id-1",
					"content" : "base64 encoded data..."
				},
				{
					"type" : "attachment",
					"title" : "the_file_name.docx",
					"id" : "file-id-2",
					"content" : "base64 encoded data..."
				},
			]
		}
	}]
}

Is there a better way to do this?

Is there are better way to do this?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.