Process - Reshape a CSV file using logstash

baselai · August 14, 2019, 9:14pm

Is there a way where I can process any csv file dynamically detect the data_type for the columns, either float or string.
as an example, I have this csv file table

and I need to dynamically turn it into this schema and then push it to Elastic search so I can do dome aggregations on it.

not sure If I can do it using logstash / ruby or not.

Thanks for any help or suggestion in advance.

Badger · August 14, 2019, 9:28pm

If you are willing to use ruby you can make logstash do anything. You can turn it into a C++ compiler with enough ruby code.

Does this really need to dynamically detect the column type or can you hardcode that [size] is a float?

baselai · August 15, 2019, 12:13am

@Badger
For the dynamic parse, the column name might be anything, and I might have more than one column with a float value. the example I provided was just to demonstrate the idea.
actually I don't mind using any method in order to achieve what I'm looking for.
since I'm a beginner in logstash/ruby can you help me or provide a guidance on achieving it? I couldn't see clear samples that I can rely on.

Badger · August 15, 2019, 1:33pm

    csv { source => "message" target => "[@metadata][row]" autodetect_column_names => true }
    ruby {
        code => '
            r = event.get("[@metadata][row]")
            rn = r["row_number"]
            if rn
                a = []
                r.each { |k, v|
                    h = {}
                    h["row_number"] = rn
                    h["column_name"] = k
                    if v =~ /^\s*[+-]?((\d+_?)*\d+(\.(\d+_?)*\d+)?|\.(\d+_?)*\d+)(\s*|([eE][+-]?(\d+_?)*\d+)\s*)$/
                        h["column_value_float"] = v
                        h["column_value_string"] = ""
                    else
                        h["column_value_float"] = ""
                        h["column_value_string"] = v
                    end
                    a << h
                }
                event.set("foo", a)
            end
        '
    }
    split { field => "foo" }
    ruby {
        code => '
            event.get("foo").each { |k, v|
                event.set(k,v)
            }
            event.remove("foo")
        '
    }

baselai · August 15, 2019, 4:00pm

@Badger Thanks for the support, this is the input csv file:

row_number,user_name,text,size
1,Mike,Hello,11.5
2,Nicolas,Test Test,0.25
3,Sandy,Test text,1.25

and this is the part I added:

output {
    csv {
        fields => ["column_name", "column_value_string", "column_value_float", "row_number"]
        path   => "/usr/share/output/output.csv"
    }
    stdout {
        codec => rubydebug {
            metadata => true
        }
    }
}

but I got this output, it is missing the headers "columns names" and an entire row, which it is no. 3

Badger · August 15, 2019, 5:24pm

Please do not post pictures of text. Just post the text.

A csv output does not add headers by default. If you tell it to add headers it will add one row of headers for every line it outputs.

Are you sure there is a newline at the end of the third line? Maybe add a blank line to be certain.

baselai · August 15, 2019, 5:51pm

I checked the file and found that there is no newline and i fixed it.

for telling the csv to put the headers in the output, I got this output

"Note: I pasted the csv as text, but I got the image auto-generated"

column_name column_value_float column_value_string row_number
user_name Mike 1
column_name column_value_float column_value_string row_number
size 11.5 1
column_name column_value_float column_value_string row_number
text Hello 1
column_name column_value_float column_value_string row_number
row_number 1 1
column_name column_value_float column_value_string row_number
user_name Nicolas 2
column_name column_value_float column_value_string row_number
size 0.25 2
column_name column_value_float column_value_string row_number
text Test Test 2
column_name column_value_float column_value_string row_number
row_number 2 2
column_name column_value_float column_value_string row_number
user_name Sandy 3
column_name column_value_float column_value_string row_number
size 1.25 3
column_name column_value_float column_value_string row_number
text Test text 3
column_name column_value_float column_value_string row_number
row_number 3 3

csv {
        fields => ["column_name", "column_value_float", "column_value_string", "row_number"]
        path   => "/usr/share/output/output.csv"
        csv_options => {
            "write_headers" => true
            "headers" =>["column_name", "column_value_float", "column_value_string","row_number"]
        }
    }

but it is not correct, there is row_number in the columns value, and the new row of headers after each row, this is weird!

Badger · August 15, 2019, 6:32pm

The csv output calls to_csv for each event. If the to_csv options say to add a header it will do it for every event.

baselai · August 15, 2019, 6:35pm

is there a way to make it called once? so I end up with a proper csv file?
I used this property autogenerate_column_names => true but it didn't help

Badger · August 15, 2019, 6:41pm

No, I do not think so.

baselai · August 15, 2019, 6:42pm

not even building a whole csv file using ruby instead?

Badger · August 15, 2019, 6:52pm

Yes, if you wrote the file in a ruby filter instead of using an output I guess it could be done, since you could add the header in the init.

baselai · August 15, 2019, 7:01pm

I wrote this ruby filter to remove a whole row that has a row_number value but it only removed the cell's value

ruby {
            code => '
                hash = event.to_hash
                hash.each do |k,v|
                if v == "row_number"
                    event.remove(k)
                end
            end
            '
    }

because I will not have the row_number in the input and I need to remove it from the output while keeping and populating the row's number in the output even if I don't have it in the input.
I mean that my input might be:

user_name,text,size
Mike,Hello,11.5
Nicolas,Test Test,0.25
Sandy,Test text,1.25

I used this filter instead, and it didn't work:

csv {
        remove_field => ["row_number"]
    }

Badger · August 15, 2019, 7:53pm

Why not just remove this line if you do not want row_number in the output?

baselai · August 15, 2019, 8:14pm

This will remove it from the output, but I need it in the output even if it is not in the input, eventually the input will look like this

user_name,text,size
Mike,Hello,11.5
Nicolas,Test Test,0.25
Sandy,Test text,1.25

So, there should be an auto-numbering for the rows in the output.

one more thing, would it be possible to have the "headers" after each row removed if I turned the output into JSON instead of csv?

Badger · August 15, 2019, 8:16pm

That should work.

baselai · August 15, 2019, 8:40pm

would that require changing the ruby code you provided? or the way I'm processing the file? I don't have enough knowledge in that context.

Badger · August 15, 2019, 9:19pm

I thought you meant replacing the csv output with a file input plus a json_lines codec.

baselai · August 16, 2019, 3:25pm

The input will stay as a csv file like this:

user_name,text,size
Mike,Hello,11.5
Nicolas,Test Test,0.25
Sandy,Test text,1.25

and the output would be a json file (because I want to push it to elasticsearch to do aggregations on it), the json schema will something like this (auto row detection and put it in the output):

[
 {
   "column_name": "user_name",
   "column_value_string": "mike",
   "column_value_float": null,
   "row_number": 1
 },
 {
   "column_name": "text",
   "column_value_string": "hello",
   "column_value_float": null,
   "row_number": 1
 },
 {
   "column_name": "size",
   "column_value_string": "",
   "column_value_float": 11.5,
   "row_number": 1
 },
 {
   "column_name": "user_name",
   "column_value_string": "Nicolas",
   "column_value_float": null,
   "row_number": 2
 },
 {
   "column_name": "text",
   "column_value_string": "Test Test",
   "column_value_float": null,
   "row_number": 2
 },
 {
   "column_name": "size",
   "column_value_string": "",
   "column_value_float": 0.25,
   "row_number": 2
 },
 {
   "column_name": "user_name",
   "column_value_string": "Sandy",
   "column_value_float": null,
   "row_number": 3
 },
 {
   "column_name": "text",
   "column_value_string": "Test Text",
   "column_value_float": null,
   "row_number": 3
 },
 {
   "column_name": "size",
   "column_value_string": "",
   "column_value_float": 1.25,
   "row_number": 3
 }
]

baselai · August 16, 2019, 10:12pm

@Badger Thanks for the help you offered, I created a new topic that covers different output for the same input, I'm lost a little bit.

Topic		Replies	Views
Logstash can't recognize csv columns Logstash	6	725	September 20, 2019
Convert all the fields to integer in csv filter Logstash	3	471	September 11, 2020
Parsing csv file with dynamic headers/columns Logstash	7	3248	June 13, 2017
How do I conditionally parse a CSV into different columns? Logstash	2	247	August 13, 2021
CSV parsing meaning get rid of logstash csv plugin? Logstash	1	378	December 25, 2019

Process - Reshape a CSV file using logstash

Related topics