CSV: autodetect_column_names vs. autogenerate_column_names

Hi,

  1. What are the exact functional differences between autodetect_column_names and autogenerate_column_names in the csv filter?

  2. Is it bad/good using both (set to true)?

  3. IS there a recommended order for using both simultaneously?

Neither seem to be working as expected in my case (just a standard csv filled with basic info).

Since we're on this:
4) In my case, Filebeat sends the csv data to Logstash. What are the detailed operational differences if I start Filebeat before Logstash and vice versa?

Thanks.

If autogenerate_column_names is enabled, it will create its own names for columns where no name is supplied. For example, if we have a csv with two field and we parse it with

     autogenerate_column_names => false
     columns => [ "Foo" ]

then the events will only have data from the first column, which will be in a field called Foo. If we parse it with

     autogenerate_column_names => true
     columns => [ "Foo" ]

then the events will have two fields of data. One call Foo and one called column2.

For autodetect ... suppose we have a file that contains

foo,BAR,baz
1,2,3

If we parse that using

csv {
     autodetect_column_names => true
}

We will get a single event that has these fields in it

       "foo" => "1",
       "baz" => "3",
       "BAR" => "2"

It detects the column name by consuming the header line.

Setting both to true might be useful if you were consuming a file that had an incomplete header line, like this

foo,BAR
1,2,3
1 Like

This is incredibly helpful, thank you so much Badger. This could also explain why I've been getting the same data parsed twice, I had strong suspicions with these two settings but wasn't sure of their exact difference.
Thanks again.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.