Problem Ingesting CSV with quoted description field


(Emmanuel Galeana) #1

Hello everyone I'm having problems with a csv file with information like this

F18AP00207,,,SAI EXEMPT,83641.00,77059.00,160700,0.00,0.00,0.00,0.00,2018-05-29,2018-05-29,2021-06-30,014,DEPARTMENT OF THE INTERIOR (DOI),1448,U.S. FISH AND WILDLIFE SERVICE,,,,,,,,,534761056,ENVIRONMENT SOCIETY OF OMAN,OMN,OMAN,AL ELEM ST- AL KHUWAIR,,,,,,,,,,,,FOREIGN CITY,,,OMN,OMAN,00FORGN,,,,,,,OMAN,15.645,MARINE TURTLE CONSERVATION FUND,04,"THE MARINE TURTLE CONSERVATION FUND IS SOLICITING PROPOSALS FOR THE CONSERVATION OF MARINE TURTLES (CHELONIA MYDAS, CARETTA CARETTA, DERMOCHELYS CORIACEA, ERETMOCHELYS IMBRICATA, LEPIDOCHELYS OLIVACEA, LEPIDOCHELYS KEMPII) THROUGHOUT THEIR RANGE OUTSIDE OF THE UNITED STATES AND ITS TERRITORIES. THE U.S. GOVERNMENT ENACTED THE MARINE TURTLE CONSERVATION ACT (MTCA) OF 2004 IN RESPONSE TO THE DECLINE OF MANY MARINE TURTLE POPULATIONS WORLDWIDE AND THE SERIOUS THREATS TO THEIR LONG-TERM SURVIVAL. THE PRIMARY PURPOSE OF THE ACT IS TO PROVIDE FINANCIAL SUPPORT FOR PROJECTS THAT CONSERVE NESTING POPULATIONS AND HABITAT AND ADDRESS OTHER THREATS TO THE SURVIVAL OF MARINE TURTLES IN FOREIGN COUNTRIES.

THE GOAL OF THIS PROGRAM IS TO REDUCE THREATS TO MARINE TURTLES IN THEIR NATURAL HABITAT. PROPOSALS SHOULD IDENTIFY SPECIFIC CONSERVATION ACTIONS THAT HAVE A HIGH LIKELIHOOD OF CREATING LASTING BENEFITS. PROJECT ACTIVITIES THAT EMPHASIZE DATA COLLECTION AND STATUS ASSESSMENT SHOULD DESCRIBE A DIRECT LINK TO MANAGEMENT ACTION, AND EXPLAIN HOW LACK OF INFORMATION HAS BEEN A KEY LIMITING FACTOR FOR MANAGEMENT ACTION IN THE PAST. PROPOSALS THAT DO NOT IDENTIFY HOW ACTIONS WILL REDUCE THREATS OR THAT DO NOT DEMONSTRATE A STRONG LINK BETWEEN DATA COLLECTION AND MANAGEMENT ACTION WILL BE DISQUALIFIED.

PROPOSED PROJECT WORK SHOULD OCCUR WITHIN THE SPECIES RANGE, OUTSIDE OF THE UNITED STATES. IF WORK IS TO BE CONDUCTED OUTSIDE OF THE SPECIES RANGE, THE PROPOSAL SHOULD SHOW A CLEAR RELEVANCE TO ITS CONSERVATION.",NON,W,A,2,,2018-06-11 18:47:34.99687

The problem is that a field contains a qouted description, and that could include jump lines, and commas inside.
I tried to use
gsub=> ["message", "[\r\n]", ""]
but i didn't work.
Someone knows a possible solution to this issue?


#2

Is logstash seeing that as a single event? If it is, then a simple

csv {}

would parse it without an issue. However, if it is multiple lines then that's a different problem.


(Emmanuel Galeana) #3

Thank you for your reply, yes its a single event, my logstash filter looks like this

input {
file {
path => "doc.csv"
sincedb_path => "/dev/null"
start_position => "beginning"
codec => plain {
charset => "ISO-8859-1"
}
}

}

filter {

csv{
separator => ","
columns => [ modification_number,award_id_uri,etc... ]
}

mutate {
remove_field => ["path"]
}}
output {
stdout { codec => 'dots' }
elasticsearch {
index => 'example'
hosts => ["http://localhost:9200"]
}
}


#4

What exactly is the problem? Are there error messages? If not, then that does output { stdout { codec => rubydebug } } produce and what would you like to change in it?


(Emmanuel Galeana) #5

Yes the errors are

[2018-06-27T21:17:02,002][WARN ][logstash.filters.csv ] Error parsing csv {:field=>"message", :source=>"CRP - COST SHARES",NON,P,A,3,,2018-05-24 17:43:08.508849\r", :exception=>#<CSV::MalformedCSVError: Illegal quoting in line 1.>}
[2018-06-27T21:17:02,033][WARN ][logstash.filters.csv ] Error parsing csv {:field=>"message", :source=>"12FA00PY56960488,,,SAI EXEMPT,974.00,0.00,974.00,,,,,2018-04-17,,,012,DEPARTMENT OF AGRICULTURE (USDA),12D2,FARM SERVICE AGENCY,,,012,DEPARTMENT OF AGRICULTURE (USDA),12D2,FARM SERVICE AGENCY,,,,REDACTED DUE TO PII,USA,UNITED STATES,,,,67236,SALEM,121,MARION,IL,ILLINOIS,62881,,15,,,,USA,UNITED STATES,IL67236,SALEM,121,MARION,ILLINOIS,62881,15,,10.069,CONSERVATION RESERVE PROGRAM,06,"AUTO \r", :exception=>#<CSV::MalformedCSVError: Unclosed quoted field on line 1.>}

When this line is being processed:

12FA00PY56960488,,,SAI EXEMPT,974.00,0.00,974.00,,,,,2018-04-17,,,012,DEPARTMENT OF AGRICULTURE (USDA),12D2,FARM SERVICE AGENCY,,,012,DEPARTMENT OF AGRICULTURE (USDA),12D2,FARM SERVICE AGENCY,,,,REDACTED DUE TO PII,USA,UNITED STATES,,,,67236,SALEM,121,MARION,IL,ILLINOIS,62881,,15,,,,USA,UNITED STATES,IL67236,SALEM,121,MARION,ILLINOIS,62881,15,,10.069,CONSERVATION RESERVE PROGRAM,06,"AUTO
CRP - COST SHARES",NON,P,A,3,,2018-05-24 17:43:08.508849


#6

A csv filter has no problem with newlines in quoted fields. For example:

output { stdout { codec => rubydebug } }
input { generator { count => 1 message => 'a,"b
c",d' } }
filter {
    csv { }
}

Gets you

   "column2" => "b\nc",
   "column3" => "d",
   "column1" => "a"

It appears to me that the problem is that your input is splitting the two lines into two events. What sort of input are you using?


(Emmanuel Galeana) #7

I'm using a csv downloaded from Usaspendings
exactly this csv : https://s3-us-gov-west-1.amazonaws.com/usaspending-monthly-downloads/2018_all_Assistance_Full_20180618.zip

the problem is in the field "award description" that contains a quoted description.


#8

What does your logstash input configuration look like?

BTW, I think you replied to your own message. If you reply to my message instead then I get notified and I will come back to see what you wrote :slight_smile:


(Emmanuel Galeana) #9

lol sorry

this is my input config

input {
file {
path => "C:\Users\Latitude E5470\Documents\ELK\Documentos\Archivos\Financial\2018\assistance_prime_transactions_4.csv"
sincedb_path => "/dev/null"
start_position => "beginning"
codec => plain {
charset => "ISO-8859-1"
}
}

}


#10

On Linux you could use a multiline codec on the file input to join together lines that do not end in with Ctrl/M.

codec => multiline { pattern => "^M" negate => true what => "next" auto_flush_interval => 2 }

You would also need to remove the Ctrl/M

mutate { gsub => [ "message", "^M", "" ] }

I do not know how to do it on Windows.


(Emmanuel Galeana) #11

Ok I will try it in linux and search a way to do it in windows.
Thanks for your help!


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.