how to accurately import CSV where lines contain fields with multiline content? Default separator is comma, but multiline content is surrounded with double-quotes. For example:
Input file is actually Jira issues export jira.issueviews:searchrequest-csv-all-fields.
What's the optimal way to load Jira issues into ELK? If it turns out that Jira can be imported without medior CSV step, new topic should be created I guess.
works for most of the line-breaks, except when the line starts with closing double-quote, followed by comma (",). Pseudo example above covers these cases.
So, how to write proper regexp: every new document starts with double-quote and that double-quote is not followed by comma?
but getting all sorts of troubles like :exception=>#<CSV::MalformedCSVError: Missing or stray quote in line 1>
Finally, after pre-processing (like removing empty lines), file is imported. However, how to automate it? Multiline (rich content), quoted fields are common use-case. Is there a configuration and import example? Might want to include it in Multiline codec plugin documentation?
Hm, I think your regular expression should have two groups to capture this — one for the blank line (\n, assuming Unix line endings) and one for ",. Maybe something like this?
I do not believe a multiline codec maintains enough state to solve this in the general case. Consider
field1,field2
"field1, ya know",field2
"line1 of field1
line2
",field2
A full solution would need a codec that consumes a character at a time, not a line at a time. There are all sorts of corner cases where this mismatch breaks things. It has been discussed a lot (you can tell that from this thread between Colin and Guy).
Another variant is a line oriented input consuming a line that ends in \n and feeding to a codec that reads character pairs in UTF16. Because the input does not consume the second byte of the UTF16 character that starts with \n, the endianess of the rest of the file is flipped, and the text turned into gibberish.
It will (IMHO) never get fixed. logstash functionality is getting pulled back into beat processors, or pushed forward into elasticsearch processing pipelines. I very much doubt Elastic have an appetite to re-architect the logstash input design that almost always works.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.