Strip array off of ndjson data set using elasticsearch pipeline

Hello,

I have data being ingested into elasticsearch (currently using version 7.3) sent to it from filebeat (7.3). The logs are in ndjson format.

A section of the data is in the format {"tagset": {"username": { "domain\\username": []}} The array is empty in all cases.

("domain" and "username" being the actual domain\username of the user in the domain. Meaning, the "username" in this sense is always different for every log entry.)

The "domain\username" is really the value of the "username" key. But it is being treated as an array in this dataset and is thus being indexed wrongly in elasticsearch.

I am trying to strip the array off of the "domain\username" and make "domain\username" the value of "username" (or the value of another field that can be made/set).

Currently, I am not using logstash but rather trying to handle this using an elasticsearch pipeline. (Though, using logstash is not totally out of the question). I have tried changing this using grok and other methods to no avail (i.e. I am probably doing something wrong).

Thanks in advance for any assistance.

Hi Nik,

Can you share the snippet for the grok processor that you've tried?

Alternatively, have you tried using the set processor to update the field with the value you want? Otherwise if you already have the value in a different field you could try removing this field with the remove processor and then using the other field.

Hi Carly,

Thanks for taking this up. This is what I tried:

PUT _ingest/pipeline/my_pipeline
{
  "processors": [
    {
      "grok": {
        "field": "tagset.username.DOMAIN\\%{USERNAME:username}$",
        "patterns": ["^.*?\\\\%{USERNAME:username}$"]
      }
    },
    {
      "remove": {
        "field": "tagset.username.DOMAIN\\*"
      }
    },
    {
      "set": {
        "field": "user_name",
        "value": "{{ username }}"
      }
    }
  ]
}

As explained in the early post, the data is structured like so:

{"tagset": { "username": { "DOMAIN\\joe": [] }}}

(There are other objects within the "tagset" object, but I think this gives the idea.)

I'm guessing there is probably something wrong with the regex I'm using in the "patterns" field in my grok processor.

Thanks again for taking this up.

UPDATE:

The below pipeline "works" but produces a lot of bulk load errors with filebeat. You'll see that I changed a few things around in the grok processor and the pattern. I am also converting the tagset.username field to string.

PUT _ingest/pipeline/my_pipeline
{
  "description": "",
  "processors": [
    {
      "convert": {
        "field": "tagset.username",
        "type": "string"
      }
    },
    {
      "grok": {
        "field": "tagset.username",
        "patterns": [
          """^.*?DOMAIN\\%{WORD:username}.*?"""
        ]
      }
    },
    {
      "remove": {
        "field": "tagset.username",
        "ignore_failure": true
      }
    },
    {
      "set": {
        "field": "user_name",
        "value": "{{ username }}"
      }
    }
  ]
}

It is not really a solution as the pipeline is not ingesting all of the log data.

Appreciate the update. Can you share the precise error you're getting with filebeat?

The following solved the issue. Seems messy and may not be the best way of going about it, but it works:

PUT _ingest/pipeline/my_pipeline
{
	"description":"",
	"processors":[
	{
		"set":{
			"field":"user_name",
			"value":"{{ tagset.username }}"
		}
	},
	{
		"remove":{
			"field":"tagset.username",
			"ignore_failure":true
		}
	},
	{
		"convert":{
			"field":"user_name",
			"type":"string"
		}
	},
	{
		"gsub":{
			"field":"user_name",
			"pattern":"""\{DOMAIN\\\\""",
			"replacement":""
		}
	},
	{
		"gsub":{
			"field":"user_name",
			"pattern":"=\\[\\]\\}",
			"replacement":""
		}
	}]
}

This produces a value that is just the username minus the domain name, slashes and brackets.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.