Input codec=>multiline with filter csv not behaving as expected

I'm wanting to input a poorly formatted csv file into elasticsearch. The csv file is an indexer-csv from nutch 1.15. The file has rows as follows;

http://blackseamap.com/bshome/feed/,Comments on: Black Sea Home,"Comments on: Black Sea Home
Comments on: Black Sea Home
Dive into the mysterious waters of the Black Sea and discover what stories can be revealed.
"
http://blackseamap.com/careers-in-action/,Black Sea M.A.P – Maritime Archaeology Project | Careers in Action,"Black Sea M.A.P – Maritime Archaeology Project | Careers in Action
En
English
Bulgarian
The Mission
The Team
Education
Education Home

I have built a logstash config as follows. What I hope is that it concatenates using multiline and then parses out the url, title, and content. It never identifies the different fields separated by the commas. i just end up with the message and no id, title, and content. What have I missed?

input {
	file {
		path => "/home/monkstown/Nutch/nutch1.15/csvindexwriter/nutch.csv"
		start_position => "beginning"
		sincedb_path => "/dev/null"
    		codec => multiline {
      			pattern => "^http"
      			negate => true
      			what => "previous"
    		}
	}
}
filter {
	csv {
		separator => ","
		columns => ["id","title","content"]
	}
}
output {
	elasticsearch {
		hosts => "localhost"
		index => "oatest1"
		document_type => "oa_basic"
	}
	stdout {}
}

Thank-you

Your configuration seems to parse the first entry. The second does not have a closing double quote, so a csv filter will not parse it.

Thanks Badger,

You answered my first question, is my config file "correct". It would seem yes, until I break it

I only included the top of my parse file as an example. So the " then becomes the issue. Yes, there was no end quote in my example... but, there are double quotes throughout the content, so this is probably what is messing with things.

What is the best practice in dealing with " (double quote) characters embedded in the content. I'm thinking a tilde (~) would be the best replacement? Or is this something I'm going to have to alter load-to-load depending on the special characters in the content?

Thank-you.

Provided there are no commas in the third field then you could mutate+gsub all the double quotes to something else. If there are commas then I would give up on csv and use dissect

dissect { mapping => { "message" => "%{field1},%{field2},%{field3}" } }

In fact I might go with that in preference to csv in the first place.

1 Like

Badger.
Fantastic! Thank-you!
Dissect worked well.
So much to learn so little time.
Peter

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.