Strip array off of ndjson data set using elasticsearch pipeline

nika · March 20, 2023, 11:10pm

Hello,

I have data being ingested into elasticsearch (currently using version 7.3) sent to it from filebeat (7.3). The logs are in ndjson format.

A section of the data is in the format {"tagset": {"username": { "domain\\username": []}} The array is empty in all cases.

("domain" and "username" being the actual domain\username of the user in the domain. Meaning, the "username" in this sense is always different for every log entry.)

The "domain\username" is really the value of the "username" key. But it is being treated as an array in this dataset and is thus being indexed wrongly in elasticsearch.

I am trying to strip the array off of the "domain\username" and make "domain\username" the value of "username" (or the value of another field that can be made/set).

Currently, I am not using logstash but rather trying to handle this using an elasticsearch pipeline. (Though, using logstash is not totally out of the question). I have tried changing this using grok and other methods to no avail (i.e. I am probably doing something wrong).

Thanks in advance for any assistance.

carly.richmond · March 22, 2023, 1:57pm

Hi Nik,

Can you share the snippet for the grok processor that you've tried?

Alternatively, have you tried using the set processor to update the field with the value you want? Otherwise if you already have the value in a different field you could try removing this field with the remove processor and then using the other field.

nika · March 22, 2023, 4:48pm

Hi Carly,

Thanks for taking this up. This is what I tried:

PUT _ingest/pipeline/my_pipeline
{
  "processors": [
    {
      "grok": {
        "field": "tagset.username.DOMAIN\\%{USERNAME:username}$",
        "patterns": ["^.*?\\\\%{USERNAME:username}$"]
      }
    },
    {
      "remove": {
        "field": "tagset.username.DOMAIN\\*"
      }
    },
    {
      "set": {
        "field": "user_name",
        "value": "{{ username }}"
      }
    }
  ]
}

As explained in the early post, the data is structured like so:

{"tagset": { "username": { "DOMAIN\\joe": [] }}}

(There are other objects within the "tagset" object, but I think this gives the idea.)

I'm guessing there is probably something wrong with the regex I'm using in the "patterns" field in my grok processor.

Thanks again for taking this up.

UPDATE:

The below pipeline "works" but produces a lot of bulk load errors with filebeat. You'll see that I changed a few things around in the grok processor and the pattern. I am also converting the tagset.username field to string.

PUT _ingest/pipeline/my_pipeline
{
  "description": "",
  "processors": [
    {
      "convert": {
        "field": "tagset.username",
        "type": "string"
      }
    },
    {
      "grok": {
        "field": "tagset.username",
        "patterns": [
          """^.*?DOMAIN\\%{WORD:username}.*?"""
        ]
      }
    },
    {
      "remove": {
        "field": "tagset.username",
        "ignore_failure": true
      }
    },
    {
      "set": {
        "field": "user_name",
        "value": "{{ username }}"
      }
    }
  ]
}

It is not really a solution as the pipeline is not ingesting all of the log data.

carly.richmond · March 23, 2023, 9:18am

Appreciate the update. Can you share the precise error you're getting with filebeat?

nika · March 23, 2023, 6:44pm

The following solved the issue. Seems messy and may not be the best way of going about it, but it works:

PUT _ingest/pipeline/my_pipeline
{
	"description":"",
	"processors":[
	{
		"set":{
			"field":"user_name",
			"value":"{{ tagset.username }}"
		}
	},
	{
		"remove":{
			"field":"tagset.username",
			"ignore_failure":true
		}
	},
	{
		"convert":{
			"field":"user_name",
			"type":"string"
		}
	},
	{
		"gsub":{
			"field":"user_name",
			"pattern":"""\{DOMAIN\\\\""",
			"replacement":""
		}
	},
	{
		"gsub":{
			"field":"user_name",
			"pattern":"=\\[\\]\\}",
			"replacement":""
		}
	}]
}

This produces a value that is just the username minus the domain name, slashes and brackets.

system · April 20, 2023, 6:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ingest: transforming multiple values in an array Elasticsearch ingest-pipeline	3	752	November 30, 2023
Elastic pipeline processors grok for question! Elasticsearch ingest-pipeline	1	305	June 16, 2023
_ingest processor Elasticsearch	3	452	May 26, 2017
Elastic Search -> Ingest Node -> processor -> grok Elasticsearch	4	1079	July 5, 2017
(Ingest Pipeline) Grok on an Array using Foreach -> Pipeline processors Elasticsearch	2	1692	September 16, 2020

Strip array off of ndjson data set using elasticsearch pipeline

Related topics