Resolved - Logstash - CSV parse failure for some records

cheriemilk · March 5, 2019, 8:24am

Hi Team,

I have a huge csv file exported from github which is the commit information of engineers. In the csv, there're two columns (COMMIT_COMMENT and JIRA_KEY) might contain special characters, but unlikely I am not able to knowin advance what and how many special characters the file will contain as the content are filled in by engineer manually in JIRA system and can't be controlled. Theoretically, they can input any special characters or even html tags.

1. Piece of Sample CSV file which has special characters

> "GIT_ORG","GIT_REPOS","COMMIT_SHA1","COMMIT_AUTHOR","COMMIT_DATE","COMMIT_COMMENT","JIRA_KEY","JIRA_ISSUE_TYPE","JIRA_ISSUE_CATEGORY","JIRA_PRIORITY","JIRA_COMPONENTS","JIRA_DEFECT_TYPE","JIRA_TESTCASE_ID","JIRA_FIND_BY_AUTO"
>  "bizx","au-recruiting","d88e03469576f30079c366850fc952864815fb65","Kaderjan Ilghar","2018-01-24","fixing SML-710\, SML-714\,SML-708\,SML-709","","","","",,,,
> "bizx","au-V4","f54abe7a0516407a8f6e2334ce8b1a1c229dd6aa","wwang","2016-09-19","TFT- DTCE new common component change ( Change FeatureEnum type)","","","","",,,,
> "bizx","au-V4","da5ab4deecd598e514bc3848d1dd0204643ccc0a","I848428","2017-03-01","PLT-56858: Remove the \"field-permissions\" level message key for field-permissions.","PLT-56858","Defect","Internal Issue","3-Medium","|Z. DO NOT USE Permission|","New","","No"

My Logstash configuration:

          input { file{ path => "C:/elkstack/elasticsearch-6.5.1/logs/git_commits.csv"		
                  start_position => "beginning"
    			  sincedb_path => "NUL" }
    	  }
filter { csv { columns => [ "GIT_ORG",
                            "GIT_REPOS",
                            "COMMIT_SHA1",
                            "COMMIT_AUTHOR",
							"COMMIT_DATE",
							"COMMIT_COMMENT",
							"JIRA_KEY",
							"JIRA_ISSUE_TYPE",
						    "JIRA_ISSUE_CATEGORY",
						    "JIRA_PRIORITY",
							"JIRA_COMPONENTS",
							"JIRA_DEFECT_TYPE",
							"JIRA_TESTCASE_ID",
							"JIRA_FIND_BY_AUTO"]
               separator => ","}		    	
         translate { destination => "[@metadata][lookup]" 
		             dictionary_path => "C:/elkstack/elasticsearch-6.5.1/logs/LookUp.csv" 
					 field => "COMMIT_SHA1" }
						
         dissect { mapping => { "[@metadata][lookup]" => "%{FILE_CHANGED};%{FILE_TYPE};%{FILE_METHOD}" }}
	   }


output {
    elasticsearch {
	 action => "index"
	 hosts  => "localhost:9200"
	 index  => "logstash-git" }
	stdout { codec => rubydebug }
}

logstash trace log

>     > {
>     >           "tags" => [
>     >         [0] "_csvparsefailure"
>     >     ],
>     >           "host" => "PVGN50859047A",
>     >       "@version" => "1",
>     >     "@timestamp" => 2019-03-05T09:16:47.430Z,
>     >        "message" => "\"bizx\",\"au-V4\",\"da5ab4deecd598e514bc3848d1dd0204643ccc0a\",\"I848428\",\"2017-03-01\",\"PLT-56858: Remove the \\\"field-permissions\\\" level message key for field-permissions.\",\"PLT-56858\",\"Defect\",\"Internal Issue\",\"3-Medium\",\"|Z. DO NOT USE Permission|\",\"New\",\"\",\"No\"\r",
>     >           "path" => "C:/elkstack/elasticsearch-6.5.1/logs/less.csv"
>     > }

Please help advise how should I get rid of issue? Precondition is that I am not able to know how many special special characters will be in the csv. Is there a general way can resolve this?

Ganesh2303 · March 5, 2019, 10:51am

HI,
If you have special character in your file mean my best opinion is instead of going csv filter you can extract using grok filter to avoid that kind of issue.

cheriemilk · March 5, 2019, 10:28pm

Hi Ganesh - How to do it with grok filter

Badger · March 5, 2019, 11:03pm

The following will process the first 5 and last 7, leaving you with 2 fields in [message]. Not sure how to parse those 2.

    if [message] =~ /^"GIT_ORG"/ { drop {} }

    #Strip and process 5 well formatted fields at the beginning
    grok { match => { "message" => '^(?<first5>("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*)"|),' } }
    mutate { gsub => [ "message", '^("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),', "" ] }
    csv {
        source => "first5"
        columns => [ "GIT_ORG", "GIT_REPOS", "COMMIT_SHA1", "COMMIT_AUTHOR", "COMMIT_DATE"]
        remove_field => [ "first5" ]
    }

    #Strip and process 7 well formatted fields at the end
    grok { match => { "message" => ',(?<last7>("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|))$' } }
    mutate { gsub => [ "message",           ',("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|),("[^"]*"|)$', "" ] }
    csv {
        source => "last7"
        columns => [ "JIRA_ISSUE_TYPE", "JIRA_ISSUE_CATEGORY", "JIRA_PRIORITY", "JIRA_COMPONENTS", "JIRA_DEFECT_TYPE", "JIRA_TESTCASE_ID", "JIRA_FIND_BY_AUTO" ]
        remove_field => [ "last7" ]
    }

It may require some adhocery, but for those 3 examples this would work

grok { match => { "message" => '^"%{DATA:COMMIT_COMMENT}","%{DATA:JIRA_KEY}"$' } }

Ganesh2303 · March 6, 2019, 9:38am

Try with this pattern it will work but you need value on all the column after comma

"%{WORD:data1}","?(?[a-zA-Z0-9-]+)","%{WORD:data3}","%{WORD:data4}","?(?[a-zA-Z0-9-]+)","?(?[A-Za-z0-9:\" .-]+)","?(?[a-zA-Z0-9-]+)","?(?[a-zA-Z0-9-]+)","?(?[a-zA-Z0-9 -]+)","?(?[a-zA-Z0-9-]+)","?(?[a-zA-Z0-9 .|-]+)","?(?[a-zA-Z0-9-]+)","?(?[a-zA-Z0-9 -]+)","?(?[a-zA-Z0-9-]+)

cheriemilk · March 8, 2019, 1:31pm

Thank you Badger. helpful suggestion.

cheriemilk · March 8, 2019, 1:32pm

Thank you Ganesh, it prompts me another way.

system · April 5, 2019, 1:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Grok Parse Failure - CSV input via Filebeat Logstash	3	1036	November 22, 2018
Special characters in data preventing CSV import using logstash 6.6.2 Logstash	5	2946	April 17, 2019
CSV and double qoutes Logstash	5	929	July 19, 2017
Csv parsing through logstash Logstash	4	2308	May 30, 2018
Csv upload error in logstash Logstash	20	1637	January 11, 2019

Resolved - Logstash - CSV parse failure for some records

Related topics