Ingest Attachment Plugin: How to add data to an existing array?

Hi :slightly_smiling_face:

I'm a beginner to Elasticsearch, so maybe this is simple to solve, but I haven't found a solution to my problem:
I have an object with an array containing attachment data, and I want to be able to append new data to this array. I know how to use the ingest plugin to retrieve the data out of attachments, and I know how to append to an array, but I cannot get both to work in a single step. Could someone help me? I'd be happy about any hint!


My current procedure is...suboptimal :see_no_evil: - for completeness, however:

  1. Create a temporary index, where the data of an ingested attachment is stored
  • I have a processor for the attachments
    "processors": [
        {
            "foreach": {
                "field": "Attachments",
                "processor": {
                    "attachment": {
                        "target_field": "_ingest._value.attachment",
                        "field": "_ingest._value.data"
                    }
                }
            }
        },
        {
            "foreach": {
                "field": "Attachments",
                "processor": {
                    "remove": {
                        "field": "_ingest._value.data"
                    }
                }
            }
        }
    ]
  • and post it to /tmpattachments/_doc/randomid?pipeline=Attachments
  1. Recover the extracted information
  • Just use a get on the randomid generated in step 1.
  1. Again send this information to Elasticsearch, and this time use a script, to append it to the array AttachedFiles, where it should go:
    "script": {
        "source": "ctx._source.AttachedFiles.addAll(params.new)"
        "params": {
            "new": [
                {
                    "Content": "extracted info",
                    "Filename": "file.pdf"
                }
            ]
        },
    }
  • post it to /finaldestination/_update/123
  1. Delete the temporary attachment object from step 1.

I don't think you can easily "add" something to the array as AFAIK the ingest pipeline can not be called with an update document call.

Best option is to send again the whole document, which means encode again all the binaries as BASE64 and send that over the wire.

Another option would be not to index an array of binaries but each binary individually. (Denormalizing the data that is).

Ok...thank you for the info. If that is the case, I might even keep my current setup.

As I see it, sending the complete data again, would only be an improvement, if those additions to the array happen less then four or five times (the exact numbers don't matter, but at least not often).

Denormalization is something I read about and is definitely interesting, but I would have to evaluate this first in terms of additional redundant data produced, and find some solutions to things like searching for a set amount of results, where all results should point to a different object, and other stuff. (This is not another question asked - for now. :wink: )

Anyways, thanks!

Edit: If I can read your post as "it is almost certainly not possible", this thread could be closed, else I would prefer to wait, if someone maybe still has an idea... :slight_smile: (Things like using a script, to read out data from one (temporary) object and add it to another directly within Elasticsearch, are probably not possible, either, are they?)

The only way I can see would be:

  • Send the BASE64 content to the _simulate ingest endpoint.
  • Get back the result of the extraction
  • Send an "update" with this new data to be added to the existing array.

Thanks, I'll definitely try that out.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.