How to import CSV files, where couple of fields have multiline content

Hello,

how to accurately import CSV where lines contain fields with multiline content? Default separator is comma, but multiline content is surrounded with double-quotes. For example:

Summary,Issue key,Issue id,Parent id,Issue Type,Status,Project key
Content1,Content2,"Multiline content 3 line1
Multiline content line2

Multiline content line3
",Content4,"Multiline content 5 line1
Multiline content 5 line2
Multiline content 5 line3
",Content6,Content7

Input file is actually Jira issues export jira.issueviews:searchrequest-csv-all-fields.
What's the optimal way to load Jira issues into ELK? If it turns out that Jira can be imported without medior CSV step, new topic should be created I guess.

Regards

There is a Multiline codec plugin
but what pattern/configuration to use to distinguish

  1. simple fields (just comma separated)
  2. multiline fields (double-quotes+comma separated)
    ?

Regards

All fields are now quoted for sake of consistency.

        codec => multiline {        
            pattern => "^\""
            negate => true
            what => "previous"
        }  

works for most of the line-breaks, except when the line starts with closing double-quote, followed by comma (",). Pseudo example above covers these cases.
So, how to write proper regexp: every new document starts with double-quote and that double-quote is not followed by comma?

Regards

Have also tried

        codec => multiline {        
            pattern => "^\"[^,]"
            negate => true
            what => "previous"
        } 

but getting all sorts of troubles like :exception=>#<CSV::MalformedCSVError: Missing or stray quote in line 1>

Finally, after pre-processing (like removing empty lines), file is imported. However, how to automate it?
Multiline (rich content), quoted fields are common use-case. Is there a configuration and import example? Might want to include it in Multiline codec plugin documentation?

Regards

Hm, I think your regular expression should have two groups to capture this — one for the blank line (\n, assuming Unix line endings) and one for ",. Maybe something like this?

codec => multiline {        
    pattern => "^(\",|\n)"
    negate => true
    what => "previous"
}

I do not believe a multiline codec maintains enough state to solve this in the general case. Consider

field1,field2
"field1, ya know",field2
"line1 of field1
line2
",field2

A full solution would need a codec that consumes a character at a time, not a line at a time. There are all sorts of corner cases where this mismatch breaks things. It has been discussed a lot (you can tell that from this thread between Colin and Guy).

Another variant is a line oriented input consuming a line that ends in \n and feeding to a codec that reads character pairs in UTF16. Because the input does not consume the second byte of the UTF16 character that starts with \n, the endianess of the rest of the file is flipped, and the text turned into gibberish.

It will (IMHO) never get fixed. logstash functionality is getting pulled back into beat processors, or pushed forward into elasticsearch processing pipelines. I very much doubt Elastic have an appetite to re-architect the logstash input design that almost always works.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.