Ingest: transforming multiple values in an array

Hey,

I'm looking for a way to transform

{
  "related": {
    "user": [
      "user1@domain",
      "user2@anotherdomain"
    ]
  }
}

into

{
  "related": {
    "user": [
      "user1@domain",
      "user1",
      "user2@anotherdomain",
      "user2"
    ]
  }
}

... using ingest processors. Basically, I have an array of User Principal Names and want to extend the field with just the sAMAccountNames also.

I've tried

  • Just Grok: does not work on Array fields
  • Just Split: does not work on Array fields
  • Foreach + Grok: Target expression %{DATA:_ingest.temp_user_name}@%{GREEDYDATA:_ingest.temp_user_domain} always overrides the _ingest.temp_user_name and does not append - so at the end of the foreach, only one value is in _ingest.temp_user_name.
  • Foreach + Split without target_field (in-place): yields [["user1","domain"],["user2","anotherdomain"]] - but also overrides the original data. plus I wouldn't know how to collect this into just ["user1", "user2"]
  • Foreach + Split with target_field: Will also override the target_field and only keep the value of the last iteration.

Anyone got more Ideas on how to solve this? do i have to go with script processors? :frowning:

HI @nemhods ,

Welcome to the Elastic community. I think yes you have to go with script processor, Since this requirement looks more custom.

Below is something which worked for me.

Create a Pipeline

PUT _ingest/pipeline/test-p1
{
  "description": "Convert email addresses to both full and username format",
  "processors": [
    {
      "script": {
        "source": """
          ctx.related.tmp_user = new ArrayList();
          for (int i = 0; i < ctx.related.user.size(); i++) {
            def email = ctx.related.user[i];
            def username = email.splitOnToken('@')[0];
            ctx.related.tmp_user.add(username);
            ctx.related.tmp_user.add(email);
          }
          ctx.related.user = ctx.related.tmp_user;
          ctx.related.remove('tmp_user');
        """,
        "lang": "painless"
      }
    }
  ]
}

Index sample data

POST test-index1/_doc?pipeline=test-p1
{
  "related":{
    "user":[
      "test@domain.com",
      "test1@domain1.com",
      "test2@domain2.com"
    ]
  }
}

Output

GET test-index1/_search

Docs

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "test-index1",
        "_id": "KuIriosBa7YTodDiw2wW",
        "_score": 1,
        "_source": {
          "related": {
            "user": [
              "test",
              "test@domain.com",
              "test1",
              "test1@domain1.com",
              "test2",
              "test2@domain2.com"
            ]
          }
        }
      }
    ]
  }
}
1 Like

Hey,

awesome, thanks for already providing the script solution.
I've added a check to make sure the operation only runs on array members that contain an "@" sign.

ctx.related.tmp_user = new ArrayList();
for (int i = 0; i < ctx.related.user.size(); i++) {
    def email = ctx.related.user[i]; // email/upn
    ctx.related.tmp_user.add(email);
    if (email.contains("@")) {
        def username = email.splitOnToken('@')[0];
        ctx.related.tmp_user.add(username);
    }
}
ctx.related.user = ctx.related.tmp_user;
ctx.related.remove('tmp_user');
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.