How to handle comma delimited message with comma's existing within quotation marks


(Andrew) #1

Hi All,

Working on yet another curly parsing issue which I'm seeking community support as I have not readily been able to derive a graceful solution.

Issue: Ingested message is in the format "string A,string B,string C,string D" with finite number of fields that stay the same. Some logs do not populate all fields (they stay blank) but the order in which the fields present stay the same.

The issue I have is with logs following this format;

  • message1 => "field 1, field 2, field 3, field 4, "field, 5", field 6, field 7.
  • message2 => "field 1, field 2, field 3, field 4, field 5, field 6, field 7.
  • message3 => "field 1, field 2, field 3, field 4, "field,5,is,present,here", field 6 ....

Notes: Field 5 is a field that supports multiple formats, based on an upstream client system in which we have no control over. However, without going to the extreme of trying to determine all characters supported by mail clients and which could be populated in field 5, I thought there may be some grok or grok + regex combination that would allow me to find all "commas" between the double quotations and transform those prior to the GROK filter parsing out the fields.

The issue is that the grok pattern I have to handle this works fine for all logs that do not have double quotations and "commas" in them. Example of error:

  • message 1 is parsed, fields 1 through 4 populate successfully, field 5 then becomes "field and field 6 becomes 5" and then all fields get displaced after. This is not ideal. For the most part the grok pattern works until I hit this specific use case and unsure how to handle it.

Real log which has issues:
2017-05-02T23:59:59.974Z,,ex2k10,,ex2k10,08D48EEB952EDEF0;2017-05-02T23:59:59.846Z;0,,STOREDRIVER,DELIVER,62,0a.09hh.2dk.pj08k8fohb.0@wingels.netline.com,testuser@blah.net,TradePub,26189,1,,,"Bob, you have 3 new eBooks available now.",Offers@Tradepubs.nl00.net,Offers@Tradepubs.nl00.net,2017-05-02T23:59:51.113Z;SRV=ex2k10.blah.net:TOTAL=8|SMR=8,Incoming,,,,S:MailboxDatabaseName=mailbox database 1323629167;S:DatabaseHealth=-1

Real log which parse fine:
2017-05-10T07:06:19.229Z,,ex2k10,,ex2k10,08D494F47E75FBED;2017-05-10T07:06:19.194Z;0,,STOREDRIVER,DELIVER,57,005501d2c95c$79d48230$6d7d8690$@test2.com,testuser@blah.com,,17240,1,,,RE: VMWare License,targetuser@test2.com,targetuser@test2.com,2017-05-10T07:06:08.326Z;SRV=ex2k10.blah.com:TOTAL=10|SMR=10,Incoming,,,,S:MailboxDatabaseName=mailbox database 1362874279;S:DatabaseHealth=-1

Grok Pattern (Attempt with Regex Match)

grok {
match => { "message" => "(%{TIMESTAMP_ISO8601:date-time})?,(%{IPORHOST:client-ip})?,(%{IPORHOST:client-hostname})?,(%{IPORHOST:server-ip})?,(%{IPORHOST:server-hostname})?,(%{GREEDYDATA:source-context})?,(%{GREEDYDATA:connector-id})?,(%{WORD:source})?,(%{WORD:event-id})?,(%{NUMBER:internal-message-id})?,(%{GREEDYDATA:message-id})?,(%{GREEDYDATA:recipient-address})?,(%{GREEDYDATA:recipient-status})?,(%{NUMBER:total-bytes})?,(%{NUMBER:recipient-count})?,(%{GREEDYDATA:related-recipient-address})?,(%{DATA:reference})?,(?(?<=,)(.*?)(?=,))?,(%{GREEDYDATA:sender-address})?,(%{GREEDYDATA:return-path})?,(%{GREEDYDATA:message-info})?,(%{WORD:directionality})?,(%{GREEDYDATA:tenant-id})?,(%{IPORHOST:original-client-ip})?,(%{IPORHOST:original-server-ip})?,(%{GREEDYDATA:custom-data})?" }
}

Grok Pattern (Attempt without Regex):

grok {
match => { "original_message" => "(%{TIMESTAMP_ISO8601:date-time})?,(%{IPORHOST:client-ip})?,(%{IPORHOST:client-hostname})?,(%{IPORHOST:server-ip})?,(%{IPORHOST:server-hostname})?,(%{GREEDYDATA:source-context})?,(%{GREEDYDATA:connector-id})?,(%{WORD:source})?,(%{WORD:event-id})?,(%{NUMBER:internal-message-id})?,(%{DATA:message-id})?,(%{GREEDYDATA:recipient-address})?,(%{GREEDYDATA:recipient-status})?,(%{NUMBER:total-bytes})?,(%{NUMBER:recipient-count})?,(%{IPORHOST:related-recipient-address})?,(%{DATA:reference})?,(%{DATA:message-subject})?,(%{GREEDYDATA:sender-address})?,(%{GREEDYDATA:return-path})?,(%{GREEDYDATA:message-info})?,(%{WORD:directionality})?,(%{DATA:tenant-id})?,(%{IPORHOST:original-client-ip})?,(%{IPORHOST:original-server-ip})?,(%{GREEDYDATA:custom-data})?" }
}

I also attempted a mutate on the original message which would create a new message field with an attempt to find any "comma" matches between quotations and then perform the grok on the updated original message with the modification. This would have worked well assuming it only matched the comma and not all text and allowed me to change the "comma" to something else prior to running the grok filter across it. Ideally, if I could change it to not match (.*?) but to match all "commas" between the quotations, this would handle the use case where this specific field may contain multiple comma's which would all need to be modified in advance. It would also possible allow me to perform the modification across other fields which may also contain "commas". Given it a reasonably good crack but I feel I've started to over cook the solution and suspect there's a much easier way to go about it.

Please let me know, stuck!

Thanks,
Andrew


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.