Can the Ingest Attachment Processor Plugin extract array data?


(M4urice) #1

Hello! :slight_smile:

So I am diving into v5 at the moment and want to use the Ingest Attachment Processor Plugin (https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html) in combination with the Array "datatype" (https://www.elastic.co/guide/en/elasticsearch/reference/current/array.html)

A document would look somewhat like this:

   {
    "_index": "visual-draft-3",
    "_type": "document",
    "_id": "441FDE6CFFF3D11EC12570F10053DE49",
    "_score": 1,
    "_source": 
        {
        "EingangMuster": null,
        "Produkt_13": null,
        "Produkt_11": null,
        "attachments": [
            {
            "filename": "somedoc.pdf",
            "data":  "base64DataString" 
            },
            {
            "filename": "somedoc.docx",
            "data": "base64DataString"
            }
                    ]
        }
    }

Ingest Pipeline:

esClient.ingest.putPipeline({
    id: "attachment_pipe",
    body: {
        "description": "Process document attachments",
        "processors": [{
            "attachment": {
                "field": "data",
                "indexed_chars": -1
            }
        }]
    }
}, function(error, response) {
    console.log(error, response);
});

This does not work out of the box for me, since I now have an array of attachments in one document which itself has the data field for the pipeline. The error output:

org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: IllegalArgumentException[field [data] not present as part of path [data]];
	at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:166) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:41) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.PipelineExecutionService$1.doRun(PipelineExecutionService.java:65) [elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:520) [elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.0.1.jar:5.0.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_92]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_92]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
Caused by: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: IllegalArgumentException[field [data] not present as part of path [data]];
	... 11 more
Caused by: org.elasticsearch.ElasticsearchParseException: Error parsing document in field [data]
	at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:131) ~[?:?]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-5.0.1.jar:5.0.1]
	... 9 more
Caused by: java.lang.IllegalArgumentException: field [data] not present as part of path [data]
	at org.elasticsearch.ingest.IngestDocument.resolve(IngestDocument.java:308) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.IngestDocument.getFieldValue(IngestDocument.java:114) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.IngestDocument.getFieldValueAsBytes(IngestDocument.java:141) ~[elasticsearch-5.0.1.jar:5.0.1]
	at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:71) ~[?:?]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-5.0.1.jar:5.0.1]
	... 9 more

How can the ingest plugin access the nested Array data?

Possible alternatives :

  • Create a type for attachments and use the plugin there (then I would have to combine query results)
  • Add n data fields to each document and add the attachment data there (don't want to do this, many null fields, and cluttered structure, sometimes n>50!)

I welcome any ideas, solutions and creative input! :wink:


(M4urice) #2

push

What can I do to get an answer? Is some information missing? Is the question not clear?


(David Pilato) #3

The question is really clear. It's on my TODO (TO ANSWER) list :slight_smile:

I wonder if you can try to use a foreach processor for that?

Can you give it a try and come back here?


(Martijn Van Groningen) #4

The foreach should be able to help you here like David said. If it doesn't then I think that should be fixed.


(Gameldar) #5

It doesn't seem to work exactly as you'd like, the attachment details are only available for the first entry because it looks like they are inserted at the top level, rather than in the array element being accessed. But I'll freely admit I'm not 100% sure I'm using the processors correctly.

However, using the simulate API with the following document:

POST _ingest/pipeline/_simulate?verbose&pretty
{

  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "foreach":
        {
          "field": "attachments",
          "processor": {
            "attachment": {
              "field": "_ingest._value.data"
            }
          } 
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "type",
      "_id": "id",
      "_source": {
        "attachments": [
          {
            "filename": "ipsum.txt",
            "data": "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
          },
          {
            "filename": "test.txt",
            "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
          }
        ]
      }
    }
  ]
}

This results in:

{
  "docs" : [
    {
      "processor_results" : [
        {
          "doc" : {
            "_id" : "id",
            "_index" : "index",
            "_type" : "type",
            "_source" : {
              "attachments" : [
                {
                  "data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
                  "filename" : "ipsum.txt"
                },
                {
                  "data" : "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
                  "filename" : "test.txt"
                }
              ],
              "attachment" : {
                "language" : "ro",
                "content_type" : "application/rtf",
                "content" : "Lorem ipsum dolor sit amet",
                "content_length" : 28
              }
            },
            "_ingest" : {
              "_value" : null,
              "timestamp" : "2016-12-21T01:42:50.269+0000"
            }
          }
        }
      ]
    }
  ]
}

(apologies for the format seems I can't have two code blocks without the quoting)


(Gameldar) #6

Ok - I found out how to do it correctly (yay for opensource).

The key is to set the target_field on the attachment processor, and set this to a field on the _ingest._value:

e.g.

POST _ingest/pipeline/_simulate?verbose&pretty
{

  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "foreach":
        {
          "field": "attachments",
          "processor": {
            "attachment": {
              "target_field": "_ingest._value.attachment",
              "field": "_ingest._value.data"
            }
          } 
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "type",
      "_id": "id",
      "_source": {
        "attachments": [
          {
            "filename": "ipsum.txt",
            "data": "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
          },
          {
            "filename": "test.txt",
            "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
          }
        ]
      }
    }
  ]
}

This will then correctly attach it to the array value, rather than the top level field "attachment" which is the default


(David Pilato) #7

Great!

Wondering if it is worth adding this as an example in the documentation?
Would you like to open a PR to add this?


(Gameldar) #8

I've done so with Issue #22294 - hopefully I've done it correctly.


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.