Autodetect_column_names is not working as expected in csv filter pluing


(R01K) #1

HI Team,

I'm trying to parse csv files with few rows and trying to auto detect the column names but its not working as expected
for example input temp.csv

name,compan,emp,abc,address
ro han,,,be,myadd
a b c,234,3 city
,ABC CO. LTD.,mycomp,myemp,myabcc, city WEST

test.conf

input {
file {
path => "/tmp/temp.csv"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {

csv {
skip_header => true
autodetect_column_names => true
autogenerate_column_names => true
}
}
output {
stdout {codec=>rubydebug}
}

output

{
"message" => "name,compan,emp,abc,address",
"path" => "/tmp/temp.csv",
"column2" => "compan",
"myadd" => "address",
"column3" => "emp",
"ro han" => "name",
"@timestamp" => 2019-04-16T17:10:14.888Z,
"be" => "abc",
"@version" => "1"
}
{
"ro han" => "a b c",
"@timestamp" => 2019-04-16T17:10:14.925Z,
"message" => "a b c,234,3 city",
"@version" => "1",
"path" => "/tmp/temp.csv",
"column2" => "234",
"column3" => "3 city"
}
{
"message" => ",ABC CO. LTD.,mycomp,myemp,myabcc, city WEST",
"path" => "/tmp/temp.csv",
"column2" => "ABC CO. LTD.",
"myadd" => "myabcc",
"column3" => "mycomp",
"column6" => " city WEST",
"ro han" => nil,
"@timestamp" => 2019-04-16T17:10:14.926Z,
"be" => "myemp",
"@version" => "1"
}

Kindly help here to auto detect column names

Thanks,
Rohan


#2

Have you set "--pipeline.workers 1"? You cannot use multiple worker threads with autodetect_column_names because it creates race conditions. Specifically, a second worker thread could parse the second row and use it to set the column names before the first worker thread does so, which appears to be exactly what happened here.

Issues 65 and 72 on github are related.


(R01K) #3

Thanks @Badger !

Nope i haven't set worker to 1 it was set to default