Can I conditionally trigger a processor based on the current nested iteration or execute a plugin in script?

dswitzer · January 26, 2021, 10:19pm

Is there a way to process an attachment conditionally for a nested object that's represented by an array?

I looked into the foreach processor with a conditional if, but the ctx object does not appear to hold a reference to the current array position, so I cannot find a way to write the if in a way that's valid.

I've also been playing around with the script processor to see if I could find to loop through the property and conditional process each item, but I can't find a way to access the plugin logic from the painless script.

Here's a simplified version of the index I'm using:

{
	"mappings": {
		"properties": {
			"title": { "type": "text" },
			"keywords": { "type": "keyword" },
			"article": {
				"type": "nested",
				"properties": {
					"id": {
						"type": "text"
					},
					"content": {
						"type": "text"
					},
					"type": {
						"type": "text"
					}
				}
			}
		}
	},
	"settings": {
		"number_of_shards": 1,
		"number_of_replicas": 2
	}
}

An article document might look like this:

{
	"title"="My article"
	, "keywords"=[]
	, "article": [
		  {type: "text", "id": "basic", content: "Some text"}
		, {type: "attachment", "id": "file-id-1", content: "base64 encoded data..."}
		, {type: "attachment", "id": "file-id-2", content: "base64 encoded data..."}
	]
}

My goal is to process the article nested object, but only process the array items that have a type equal to attachment.

My first attempt, was to add a conditional statement to my pipeline:

{
	"description" : "Extract attachment information from arrays",
	"processors" : [
		{
			"foreach": {
				"field": "article",
				"processor": {
					"attachment": {
						  "if": "ctx.article.type == \"attachment\""
						, "target_field": "_ingest._value.content"
						, "field": "_ingest._value.content"
					}
				}
			}
		}
	]
}

However, the ctx variable holds the root document object and I cannot figure out a way to access the current index in the article collection.

Next, I tried implementing a script, but I cannot figure out how to programmatically trigger the attachment process. The follow code was the basic code I was working off and it does successfully remove the content key for attachments, but what I want to do is replace the base64 data with the extract text.

{
	"description" : "Extract attachment information from arrays"
	, "processors" : [
		{
			"foreach": {
				"field": "article",
				"processor": {
					"script": {
						  "lang": "painless"
						, "source": #serializeJSON('
							ctx.article.stream()
								.filter(x -> x.type == "attachment")
								.forEach(x -> x.remove("content"))
							;
						')#
					}
				}
			}
		}
	]
}

Is there a way to do what I want?

dswitzer · January 27, 2021, 3:22pm

In case no one has a solution to this problem, what I'm doing currently as an intermittent step is to send attachments to a simulated pipeline to extract the text and then using those results when updating the document. This of course requires multiple steps, but I can send the attachments in bulk.

So here's an example of what I'm sending:

POST /_ingest/pipeline/_simulate`

{
	"pipeline" :
	{
		"description" : "Extract attachment information from arrays",
		"processors" : [
			{
				"foreach": {
					"field": "attachments",
					"processor": {
						"attachment": {
							"target_field": "_ingest._value.attachment",
							"field": "_ingest._value.data"
						}
					}
				}
			}
			, {
				"foreach": {
					"field": "attachments",
					"processor": {
						"remove": {
							"field": "_ingest._value.data"
						}
					}
				}
			}
		]
	},
	"docs": [{
		"_index": "tmp_attachment_pipeline_simulation",
		"_id": "a92cb4de963529804557f465570fab9d",
		"_source": {
			"attachments": [
				  { "filepath":"/some/path/to/file1", "data":"c29tZSBleGFtcGxlIHRleHQ=" }
				, { "filepath":"/some/path/to/file2", "data":"c29tZSBleGFtcGxlIHRleHQ=" }
			]
		}
	}]
}

You would just replace the attachments array with the input you want to send and then process the results to extract the content an use it in your document model.

I'm still looking for a way to do this in a single step, but for time being this works.

Anyone have a better solution?

dswitzer · January 27, 2021, 7:01pm

Here's something a pipeline that does what I want, but I don't like having to fall back to failing when the content key isn't base64. I'd rather just conditionally process it instead (plus it's theoretically possible the content could happen to be a valid base64 string, which would confuse things.)

Is there anyway to get the if property on the foreach or attachment process to see the current iteration value, so that I could just skip if the current article type was not attachment the processor is skipped?

{
	"pipeline" :
	{
		"description" : "Extract attachment information from arrays",
		"processors" : [
			{
				"foreach": {
					"field": "article",
					"processor": {
						"attachment": {
							  "field": "_ingest._value.content"
							, "target_field": "_ingest._value.attachment"
							, "ignore_failure": true
						}
					}
				}
			}
			, {
				"script": {
						"lang": "painless"
					, "source": "ctx.article.stream().filter(x -> x.type == \"attachment\").forEach(x -> { x.content = x.attachment?.content; x.remove(\"attachment\"); });"
				}
			}
		]
	}
	, "docs": [{
		"_index": "tmp_test",
		"_id": "a92cb4de963529804557f465570fab9d",
		"_source": {
				"title": "My Title"
			, "keywords": []
			, "article": [
				{
					"type" : "text",
					"title" : "Basic",
					"id" : "basic",
					"content" : "This is some text."
				},
				{
					"type" : "text",
					"title" : "Intermediate",
					"id" : "intermediate",
					"content" : "Some secondary content"
				},
				{
					"type" : "attachment",
					"title" : "the_file_name.pdf",
					"id" : "file-id-1",
					"content" : "base64 encoded data..."
				},
				{
					"type" : "attachment",
					"title" : "the_file_name.docx",
					"id" : "file-id-2",
					"content" : "base64 encoded data..."
				},
			]
		}
	}]
}

Is there a better way to do this?

dswitzer · February 16, 2021, 4:17pm

Is there are better way to do this?

system · March 16, 2021, 4:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Conditionally Apply Attachment Processor Elasticsearch painless , ingest-pipeline	1	279	July 20, 2022
Use conditional if in a Set processor Elasticsearch	11	2897	July 10, 2019
How to use the attachment processor within an array of attachments? Elasticsearch	6	2153	April 13, 2017
Foreach Ingest processor + conditional append processor Elasticsearch painless	3	2113	February 26, 2020
Script processor ingest pipelines on nested fields Elasticsearch	3	2604	April 12, 2019

Can I conditionally trigger a processor based on the current nested iteration or execute a plugin in script?

Related topics