Parse a comma separate value that contains a comma in it

Evening All,

I am trying to parse out a log entry that is comma separated, but if the SOMEUSER field exists will have a comma in it as well that should be ignored.

The fields that exist in some fashion are SOMEUSER, SOMENETWORK, SOMENETWORK

Examples of the format of the logs are:

"Smith, John A. - Some Business Title (Smi,SOMENETWORK,SOMECOMPUTER"
"Smith, John A. - Some Business Title (Smi,SOMECOMPUTER"
"SOMECOMPUTER,SOMENETWORK"
"SOMECOMPUTER"
"SOMENETWORK"

I do have a identity_type field that will tell me what kinds of identities are in the csv list that I need to split out, but I am not sure how to skip the first comma if the AD User identity field is present.

Formats I have seen in the logs:
SOMEUSER,SOMECOMPUTER
SOMEUSER,SOMENETWORK,SOMECOMPUTER
SOMECOMPUTER,SOMENETWORK
SOMEUSER,SOMENETWORK
SOMENETWORK
SOMECOMPUTER
SOMEUSER

I am looking to see if there is a way to skip the first comma if the SOMEUSER value exists.

My thoughts were to use an if statement to process things depending on what identity types are in the appropriate field, which works for everything except when the SOMEUSER field with the comma in it messes it all up.

Instead of trying to write the parsing code yourself, perhaps Csv filter plugin | Logstash Reference [7.11] | Elastic would be a good fit for your needs?

That was the first thing I had used and it does not skip the first comma when the username is there. That is what I am trying to figure out how to handle.

Smith, John would be taken as the first and second fields instead of just the first field.

Oh - the CSV being parsed is badly formed. Can you have the upstream source fix that?

If not, you could probably do the equivalent of an rsplit with a split limit of 3, except for e.g. this case that messes that idea up:

Smith, John A. - Some Business Title (Smi,SOMECOMPUTER

You need a way you can programmatically differentiate that from:

SOMEUSER,SOMENETWORK,SOMECOMPUTER

If you have rules such as:

  • user fields may contain spaces
  • computers only container uppercase letters
  • networks may contain lowercase letters

then you're in luck, you could do something like this:

grok rules:

SOMEUSER %{DATA}
SOMECOMPUTER [A-Z]+
SOMENETWORK [a-z]+

grok capture:

^%{SOMEUSER:someuser}(?:,%{SOMECOMPUTER:somecomputer})?(?:,%{SOMENETWORK:somenetwork})?$

would give you, for example:

{
  "somecomputer": "SOMECOMPUTER",
  "someuser": "Smith, John A. - Some Business Title (Smi"
}

I have a feature request in to fix the upstream data, but I am not holding my breath.

User names are mixed character with spaces.
Computer names are in uppercase, but also contain numbers. Something like a Dell Service Tag.
Network names have a mix of upper, lower and contains spaces.

I'll see if I can figure it out with grok, was just hoping for some awesome command I had not been able to find that says skip the first comma.

Instead of skipping the first comma, include it as a separate field (user part 1, user part 2) and then combine it?

Or, if it's always there, use:

SOMEUSER [^,]+, [^,]+

If not, perhaps:

SOMEUSER [^,]+(?:, [^,]+)?

I'd suggest making a corpus of your log lines and test cases for your regexes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.