Can the Ingest Attachment Processor Plugin extract array data?

M4urice · November 25, 2016, 2:39pm

Hello!

So I am diving into v5 at the moment and want to use the Ingest Attachment Processor Plugin (https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html) in combination with the Array "datatype" (https://www.elastic.co/guide/en/elasticsearch/reference/current/array.html)

A document would look somewhat like this:

   {
    "_index": "visual-draft-3",
    "_type": "document",
    "_id": "441FDE6CFFF3D11EC12570F10053DE49",
    "_score": 1,
    "_source": 
        {
        "EingangMuster": null,
        "Produkt_13": null,
        "Produkt_11": null,
        "attachments": [
            {
            "filename": "somedoc.pdf",
            "data":  "base64DataString" 
            },
            {
            "filename": "somedoc.docx",
            "data": "base64DataString"
            }
                    ]
        }
    }

Ingest Pipeline:

esClient.ingest.putPipeline({
    id: "attachment_pipe",
    body: {
        "description": "Process document attachments",
        "processors": [{
            "attachment": {
                "field": "data",
                "indexed_chars": -1
            }
        }]
    }
}, function(error, response) {
    console.log(error, response);
});

This does not work out of the box for me, since I now have an array of attachments in one document which itself has the data field for the pipeline. The error output:

org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: IllegalArgumentException[field [data] not present as part of path [data]];
	at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:166) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:41) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.PipelineExecutionService$1.doRun(PipelineExecutionService.java:65) [elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:520) [elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.0.1.jar:5.0.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_92]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_92]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
Caused by: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: IllegalArgumentException[field [data] not present as part of path [data]];
	... 11 more
Caused by: org.elasticsearch.ElasticsearchParseException: Error parsing document in field [data]
	at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:131) ~[?:?]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-5.0.1.jar:5.0.1]
	... 9 more
Caused by: java.lang.IllegalArgumentException: field [data] not present as part of path [data]
	at org.elasticsearch.ingest.IngestDocument.resolve(IngestDocument.java:308) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.IngestDocument.getFieldValue(IngestDocument.java:114) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.IngestDocument.getFieldValueAsBytes(IngestDocument.java:141) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:71) ~[?:?]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-5.0.1.jar:5.0.1]
	... 9 more

How can the ingest plugin access the nested Array data?

Possible alternatives :

Create a type for attachments and use the plugin there (then I would have to combine query results)
Add n data fields to each document and add the attachment data there (don't want to do this, many null fields, and cluttered structure, sometimes n>50!)

I welcome any ideas, solutions and creative input!

M4urice · November 29, 2016, 3:21pm

push

What can I do to get an answer? Is some information missing? Is the question not clear?

dadoonet · November 29, 2016, 5:23pm

The question is really clear. It's on my TODO (TO ANSWER) list

I wonder if you can try to use a foreach processor for that?

Can you give it a try and come back here?

mvg · November 30, 2016, 10:11am

The foreach should be able to help you here like David said. If it doesn't then I think that should be fixed.

gameldar · December 21, 2016, 1:56am

It doesn't seem to work exactly as you'd like, the attachment details are only available for the first entry because it looks like they are inserted at the top level, rather than in the array element being accessed. But I'll freely admit I'm not 100% sure I'm using the processors correctly.

However, using the simulate API with the following document:

POST _ingest/pipeline/_simulate?verbose&pretty
{

  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "foreach":
        {
          "field": "attachments",
          "processor": {
            "attachment": {
              "field": "_ingest._value.data"
            }
          } 
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "type",
      "_id": "id",
      "_source": {
        "attachments": [
          {
            "filename": "ipsum.txt",
            "data": "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
          },
          {
            "filename": "test.txt",
            "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
          }
        ]
      }
    }
  ]
}

This results in:

{
  "docs" : [
    {
      "processor_results" : [
        {
          "doc" : {
            "_id" : "id",
            "_index" : "index",
            "_type" : "type",
            "_source" : {
              "attachments" : [
                {
                  "data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
                  "filename" : "ipsum.txt"
                },
                {
                  "data" : "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
                  "filename" : "test.txt"
                }
              ],
              "attachment" : {
                "language" : "ro",
                "content_type" : "application/rtf",
                "content" : "Lorem ipsum dolor sit amet",
                "content_length" : 28
              }
            },
            "_ingest" : {
              "_value" : null,
              "timestamp" : "2016-12-21T01:42:50.269+0000"
            }
          }
        }
      ]
    }
  ]
}

(apologies for the format seems I can't have two code blocks without the quoting)

gameldar · December 21, 2016, 2:29am

Ok - I found out how to do it correctly (yay for opensource).

The key is to set the target_field on the attachment processor, and set this to a field on the _ingest._value:

e.g.

POST _ingest/pipeline/_simulate?verbose&pretty
{

  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "foreach":
        {
          "field": "attachments",
          "processor": {
            "attachment": {
              "target_field": "_ingest._value.attachment",
              "field": "_ingest._value.data"
            }
          } 
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "type",
      "_id": "id",
      "_source": {
        "attachments": [
          {
            "filename": "ipsum.txt",
            "data": "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
          },
          {
            "filename": "test.txt",
            "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
          }
        ]
      }
    }
  ]
}

This will then correctly attach it to the array value, rather than the top level field "attachment" which is the default

dadoonet · December 21, 2016, 6:06am

Great!

Wondering if it is worth adding this as an example in the documentation?
Would you like to open a PR to add this?

gameldar · December 21, 2016, 7:40am

I've done so with Issue #22294 - hopefully I've done it correctly.

system · January 18, 2017, 7:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ingest attachment Plugin exception : Elasticsearch	8	4424	January 19, 2017
Ingest Attachment Plugin: How to add data to an existing array? Elasticsearch	5	705	July 23, 2020
Ingest Attachment processor pipeline for arrays, without storing base64 data Elasticsearch	2	1242	February 7, 2020
Ingest-attachment ingest local docs Elasticsearch	4	453	November 18, 2018
Tuning Attachment Ingest with arrays (get rid of the raw data!) Elasticsearch	5	1755	February 16, 2017

Can the Ingest Attachment Processor Plugin extract array data?

Related topics