CSV: autodetect_column_names vs. autogenerate_column_names


(Aralex) #1

Hi,

  1. What are the exact functional differences between autodetect_column_names and autogenerate_column_names in the csv filter?

  2. Is it bad/good using both (set to true)?

  3. IS there a recommended order for using both simultaneously?

Neither seem to be working as expected in my case (just a standard csv filled with basic info).

Since we're on this:
4) In my case, Filebeat sends the csv data to Logstash. What are the detailed operational differences if I start Filebeat before Logstash and vice versa?

Thanks.


#2

If autogenerate_column_names is enabled, it will create its own names for columns where no name is supplied. For example, if we have a csv with two field and we parse it with

     autogenerate_column_names => false
     columns => [ "Foo" ]

then the events will only have data from the first column, which will be in a field called Foo. If we parse it with

     autogenerate_column_names => true
     columns => [ "Foo" ]

then the events will have two fields of data. One call Foo and one called column2.

For autodetect ... suppose we have a file that contains

foo,BAR,baz
1,2,3

If we parse that using

csv {
     autodetect_column_names => true
}

We will get a single event that has these fields in it

       "foo" => "1",
       "baz" => "3",
       "BAR" => "2"

It detects the column name by consuming the header line.

Setting both to true might be useful if you were consuming a file that had an incomplete header line, like this

foo,BAR
1,2,3

(Aralex) #3

This is incredibly helpful, thank you so much Badger. This could also explain why I've been getting the same data parsed twice, I had strong suspicions with these two settings but wasn't sure of their exact difference.
Thanks again.


(system) closed #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.