Input codec=>multiline with filter csv not behaving as expected

prawsthorne · July 28, 2019, 1:12am

I'm wanting to input a poorly formatted csv file into elasticsearch. The csv file is an indexer-csv from nutch 1.15. The file has rows as follows;

http://blackseamap.com/bshome/feed/,Comments on: Black Sea Home,"Comments on: Black Sea Home
Comments on: Black Sea Home
Dive into the mysterious waters of the Black Sea and discover what stories can be revealed.
"
http://blackseamap.com/careers-in-action/,Black Sea M.A.P – Maritime Archaeology Project | Careers in Action,"Black Sea M.A.P – Maritime Archaeology Project | Careers in Action
En
English
Bulgarian
The Mission
The Team
Education
Education Home

I have built a logstash config as follows. What I hope is that it concatenates using multiline and then parses out the url, title, and content. It never identifies the different fields separated by the commas. i just end up with the message and no id, title, and content. What have I missed?

input {
	file {
		path => "/home/monkstown/Nutch/nutch1.15/csvindexwriter/nutch.csv"
		start_position => "beginning"
		sincedb_path => "/dev/null"
    		codec => multiline {
      			pattern => "^http"
      			negate => true
      			what => "previous"
    		}
	}
}
filter {
	csv {
		separator => ","
		columns => ["id","title","content"]
	}
}
output {
	elasticsearch {
		hosts => "localhost"
		index => "oatest1"
		document_type => "oa_basic"
	}
	stdout {}
}

Thank-you

Badger · July 28, 2019, 11:28am

Your configuration seems to parse the first entry. The second does not have a closing double quote, so a csv filter will not parse it.

prawsthorne · July 28, 2019, 1:19pm

Thanks Badger,

You answered my first question, is my config file "correct". It would seem yes, until I break it

I only included the top of my parse file as an example. So the " then becomes the issue. Yes, there was no end quote in my example... but, there are double quotes throughout the content, so this is probably what is messing with things.

What is the best practice in dealing with " (double quote) characters embedded in the content. I'm thinking a tilde (~) would be the best replacement? Or is this something I'm going to have to alter load-to-load depending on the special characters in the content?

Thank-you.

Badger · July 28, 2019, 1:51pm

Provided there are no commas in the third field then you could mutate+gsub all the double quotes to something else. If there are commas then I would give up on csv and use dissect

dissect { mapping => { "message" => "%{field1},%{field2},%{field3}" } }

In fact I might go with that in preference to csv in the first place.

prawsthorne · July 28, 2019, 2:58pm

Badger.
Fantastic! Thank-you!
Dissect worked well.
So much to learn so little time.
Peter

system · August 25, 2019, 2:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multilined csv error while parsing Logstash	1	168	November 7, 2023
Multiline codec issue Logstash	1	1061	July 6, 2017
Multiline codec and csv filter Logstash	1	1478	August 3, 2017
Issue parcing csv file using Input csv plugin - ELK 5.5.0 Logstash	6	1159	October 24, 2017
Logstash multiline CSV Logstash	1	254	July 12, 2022

Input codec=>multiline with filter csv not behaving as expected

Related topics