Hi Team,
I have a huge csv file exported from github which is the commit information of engineers. In the csv, there're two columns (COMMIT_COMMENT and JIRA_KEY) might contain special characters, but unlikely I am not able to knowin advance what and how many special characters the file will contain as the content are filled in by engineer manually in JIRA system and can't be controlled. Theoretically, they can input any special characters or even html tags.
1. Piece of Sample CSV file which has special characters
> "GIT_ORG","GIT_REPOS","COMMIT_SHA1","COMMIT_AUTHOR","COMMIT_DATE","COMMIT_COMMENT","JIRA_KEY","JIRA_ISSUE_TYPE","JIRA_ISSUE_CATEGORY","JIRA_PRIORITY","JIRA_COMPONENTS","JIRA_DEFECT_TYPE","JIRA_TESTCASE_ID","JIRA_FIND_BY_AUTO"
> "bizx","au-recruiting","d88e03469576f30079c366850fc952864815fb65","Kaderjan Ilghar","2018-01-24","fixing SML-710\, SML-714\,SML-708\,SML-709","","","","",,,,
> "bizx","au-V4","f54abe7a0516407a8f6e2334ce8b1a1c229dd6aa","wwang","2016-09-19","TFT- DTCE new common component change ( Change FeatureEnum type)","","","","",,,,
> "bizx","au-V4","da5ab4deecd598e514bc3848d1dd0204643ccc0a","I848428","2017-03-01","PLT-56858: Remove the \"field-permissions\" level message key for field-permissions.","PLT-56858","Defect","Internal Issue","3-Medium","|Z. DO NOT USE Permission|","New","","No"
My Logstash configuration:
input { file{ path => "C:/elkstack/elasticsearch-6.5.1/logs/git_commits.csv"
start_position => "beginning"
sincedb_path => "NUL" }
}
filter { csv { columns => [ "GIT_ORG",
"GIT_REPOS",
"COMMIT_SHA1",
"COMMIT_AUTHOR",
"COMMIT_DATE",
"COMMIT_COMMENT",
"JIRA_KEY",
"JIRA_ISSUE_TYPE",
"JIRA_ISSUE_CATEGORY",
"JIRA_PRIORITY",
"JIRA_COMPONENTS",
"JIRA_DEFECT_TYPE",
"JIRA_TESTCASE_ID",
"JIRA_FIND_BY_AUTO"]
separator => ","}
translate { destination => "[@metadata][lookup]"
dictionary_path => "C:/elkstack/elasticsearch-6.5.1/logs/LookUp.csv"
field => "COMMIT_SHA1" }
dissect { mapping => { "[@metadata][lookup]" => "%{FILE_CHANGED};%{FILE_TYPE};%{FILE_METHOD}" }}
}
output {
elasticsearch {
action => "index"
hosts => "localhost:9200"
index => "logstash-git" }
stdout { codec => rubydebug }
}
logstash trace log
> > { > > "tags" => [ > > [0] "_csvparsefailure" > > ], > > "host" => "PVGN50859047A", > > "@version" => "1", > > "@timestamp" => 2019-03-05T09:16:47.430Z, > > "message" => "\"bizx\",\"au-V4\",\"da5ab4deecd598e514bc3848d1dd0204643ccc0a\",\"I848428\",\"2017-03-01\",\"PLT-56858: Remove the \\\"field-permissions\\\" level message key for field-permissions.\",\"PLT-56858\",\"Defect\",\"Internal Issue\",\"3-Medium\",\"|Z. DO NOT USE Permission|\",\"New\",\"\",\"No\"\r", > > "path" => "C:/elkstack/elasticsearch-6.5.1/logs/less.csv" > > }
Please help advise how should I get rid of issue? Precondition is that I am not able to know how many special special characters will be in the csv. Is there a general way can resolve this?