Resolved - Logstash - CSV parse failure for some records

Hi Team,

I have a huge csv file exported from github which is the commit information of engineers. In the csv, there're two columns (COMMIT_COMMENT and JIRA_KEY) might contain special characters, but unlikely I am not able to knowin advance what and how many special characters the file will contain as the content are filled in by engineer manually in JIRA system and can't be controlled. Theoretically, they can input any special characters or even html tags.

1. Piece of Sample CSV file which has special characters

> "GIT_ORG","GIT_REPOS","COMMIT_SHA1","COMMIT_AUTHOR","COMMIT_DATE","COMMIT_COMMENT","JIRA_KEY","JIRA_ISSUE_TYPE","JIRA_ISSUE_CATEGORY","JIRA_PRIORITY","JIRA_COMPONENTS","JIRA_DEFECT_TYPE","JIRA_TESTCASE_ID","JIRA_FIND_BY_AUTO"
>  "bizx","au-recruiting","d88e03469576f30079c366850fc952864815fb65","Kaderjan Ilghar","2018-01-24","fixing SML-710\, SML-714\,SML-708\,SML-709","","","","",,,,
> "bizx","au-V4","f54abe7a0516407a8f6e2334ce8b1a1c229dd6aa","wwang","2016-09-19","TFT- DTCE new common component change ( Change FeatureEnum type)","","","","",,,,
> "bizx","au-V4","da5ab4deecd598e514bc3848d1dd0204643ccc0a","I848428","2017-03-01","PLT-56858: Remove the \"field-permissions\" level message key for field-permissions.","PLT-56858","Defect","Internal Issue","3-Medium","|Z. DO NOT USE Permission|","New","","No"

My Logstash configuration:

          input { file{ path => "C:/elkstack/elasticsearch-6.5.1/logs/git_commits.csv"		
                  start_position => "beginning"
    			  sincedb_path => "NUL" }
    	  }
filter { csv { columns => [ "GIT_ORG",
                            "GIT_REPOS",
                            "COMMIT_SHA1",
                            "COMMIT_AUTHOR",
							"COMMIT_DATE",
							"COMMIT_COMMENT",
							"JIRA_KEY",
							"JIRA_ISSUE_TYPE",
						    "JIRA_ISSUE_CATEGORY",
						    "JIRA_PRIORITY",
							"JIRA_COMPONENTS",
							"JIRA_DEFECT_TYPE",
							"JIRA_TESTCASE_ID",
							"JIRA_FIND_BY_AUTO"]
               separator => ","}		    	
         translate { destination => "[@metadata][lookup]" 
		             dictionary_path => "C:/elkstack/elasticsearch-6.5.1/logs/LookUp.csv" 
					 field => "COMMIT_SHA1" }
						
         dissect { mapping => { "[@metadata][lookup]" => "%{FILE_CHANGED};%{FILE_TYPE};%{FILE_METHOD}" }}
	   }


output {
    elasticsearch {
	 action => "index"
	 hosts  => "localhost:9200"
	 index  => "logstash-git" }
	stdout { codec => rubydebug }
}

logstash trace log

>     > {
>     >           "tags" => [
>     >         [0] "_csvparsefailure"
>     >     ],
>     >           "host" => "PVGN50859047A",
>     >       "@version" => "1",
>     >     "@timestamp" => 2019-03-05T09:16:47.430Z,
>     >        "message" => "\"bizx\",\"au-V4\",\"da5ab4deecd598e514bc3848d1dd0204643ccc0a\",\"I848428\",\"2017-03-01\",\"PLT-56858: Remove the \\\"field-permissions\\\" level message key for field-permissions.\",\"PLT-56858\",\"Defect\",\"Internal Issue\",\"3-Medium\",\"|Z. DO NOT USE Permission|\",\"New\",\"\",\"No\"\r",
>     >           "path" => "C:/elkstack/elasticsearch-6.5.1/logs/less.csv"
>     > }

Please help advise how should I get rid of issue? Precondition is that I am not able to know how many special special characters will be in the csv. Is there a general way can resolve this?

HI,
If you have special character in your file mean my best opinion is instead of going csv filter you can extract using grok filter to avoid that kind of issue.

Hi Ganesh - How to do it with grok filter

The following will process the first 5 and last 7, leaving you with 2 fields in [message]. Not sure how to parse those 2.

    if [message] =~ /^"GIT_ORG"/ { drop {} }

    #Strip and process 5 well formatted fields at the beginning
    grok { match => { "message" => '^(?<first5>("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*)"|),' } }
    mutate { gsub => [ "message", '^("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),', "" ] }
    csv {
        source => "first5"
        columns => [ "GIT_ORG", "GIT_REPOS", "COMMIT_SHA1", "COMMIT_AUTHOR", "COMMIT_DATE"]
        remove_field => [ "first5" ]
    }

    #Strip and process 7 well formatted fields at the end
    grok { match => { "message" => ',(?<last7>("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|))$' } }
    mutate { gsub => [ "message",           ',("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|)$', "" ] }
    csv {
        source => "last7"
        columns => [ "JIRA_ISSUE_TYPE", "JIRA_ISSUE_CATEGORY", "JIRA_PRIORITY", "JIRA_COMPONENTS", "JIRA_DEFECT_TYPE", "JIRA_TESTCASE_ID", "JIRA_FIND_BY_AUTO" ]
        remove_field => [ "last7" ]
    }

It may require some adhocery, but for those 3 examples this would work

grok { match => { "message" => '^"%{DATA:COMMIT_COMMENT}","%{DATA:JIRA_KEY}"$' } }
1 Like

Try with this pattern it will work but you need value on all the column after comma

"%{WORD:data1}","?(?[a-zA-Z0-9-]+)","%{WORD:data3}","%{WORD:data4}","?(?[a-zA-Z0-9-]+)","?(?[A-Za-z0-9:\" .-]+)","?(?[a-zA-Z0-9-]+)","?(?[a-zA-Z0-9-]+)","?(?[a-zA-Z0-9 -]+)","?(?[a-zA-Z0-9-]+)","?(?[a-zA-Z0-9 .|-]+)","?(?[a-zA-Z0-9-]+)","?(?[a-zA-Z0-9 -]+)","?(?[a-zA-Z0-9-]+)

1 Like

Thank you Badger. helpful suggestion.

Thank you Ganesh, it prompts me another way.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.