Process - Reshape a CSV file using logstash

Is there a way where I can process any csv file dynamically detect the data_type for the columns, either float or string.
as an example, I have this csv file table

and I need to dynamically turn it into this schema and then push it to Elastic search so I can do dome aggregations on it.

not sure If I can do it using logstash / ruby or not.

Thanks for any help or suggestion in advance.

If you are willing to use ruby you can make logstash do anything. You can turn it into a C++ compiler with enough ruby code.

Does this really need to dynamically detect the column type or can you hardcode that [size] is a float?

@Badger
For the dynamic parse, the column name might be anything, and I might have more than one column with a float value. the example I provided was just to demonstrate the idea.
actually I don't mind using any method in order to achieve what I'm looking for.
since I'm a beginner in logstash/ruby can you help me or provide a guidance on achieving it? I couldn't see clear samples that I can rely on.

    csv { source => "message" target => "[@metadata][row]" autodetect_column_names => true }
    ruby {
        code => '
            r = event.get("[@metadata][row]")
            rn = r["row_number"]
            if rn
                a = []
                r.each { |k, v|
                    h = {}
                    h["row_number"] = rn
                    h["column_name"] = k
                    if v =~ /^\s*[+-]?((\d+_?)*\d+(\.(\d+_?)*\d+)?|\.(\d+_?)*\d+)(\s*|([eE][+-]?(\d+_?)*\d+)\s*)$/
                        h["column_value_float"] = v
                        h["column_value_string"] = ""
                    else
                        h["column_value_float"] = ""
                        h["column_value_string"] = v
                    end
                    a << h
                }
                event.set("foo", a)
            end
        '
    }
    split { field => "foo" }
    ruby {
        code => '
            event.get("foo").each { |k, v|
                event.set(k,v)
            }
            event.remove("foo")
        '
    }

@Badger Thanks for the support, this is the input csv file:

row_number,user_name,text,size
1,Mike,Hello,11.5
2,Nicolas,Test Test,0.25
3,Sandy,Test text,1.25

and this is the part I added:

output {
    csv {
        fields => ["column_name", "column_value_string", "column_value_float", "row_number"]
        path   => "/usr/share/output/output.csv"
    }
    stdout {
        codec => rubydebug {
            metadata => true
        }
    }
}

but I got this output, it is missing the headers "columns names" and an entire row, which it is no. 3
image

Please do not post pictures of text. Just post the text.

A csv output does not add headers by default. If you tell it to add headers it will add one row of headers for every line it outputs.

Are you sure there is a newline at the end of the third line? Maybe add a blank line to be certain.

I checked the file and found that there is no newline and i fixed it.

for telling the csv to put the headers in the output, I got this output

"Note: I pasted the csv as text, but I got the image auto-generated"

column_name column_value_float column_value_string row_number
user_name Mike 1
column_name column_value_float column_value_string row_number
size 11.5 1
column_name column_value_float column_value_string row_number
text Hello 1
column_name column_value_float column_value_string row_number
row_number 1 1
column_name column_value_float column_value_string row_number
user_name Nicolas 2
column_name column_value_float column_value_string row_number
size 0.25 2
column_name column_value_float column_value_string row_number
text Test Test 2
column_name column_value_float column_value_string row_number
row_number 2 2
column_name column_value_float column_value_string row_number
user_name Sandy 3
column_name column_value_float column_value_string row_number
size 1.25 3
column_name column_value_float column_value_string row_number
text Test text 3
column_name column_value_float column_value_string row_number
row_number 3 3

csv {
        fields => ["column_name", "column_value_float", "column_value_string", "row_number"]
        path   => "/usr/share/output/output.csv"
        csv_options => {
            "write_headers" => true
            "headers" =>["column_name", "column_value_float", "column_value_string","row_number"]
        }
    }

but it is not correct, there is row_number in the columns value, and the new row of headers after each row, this is weird!

The csv output calls to_csv for each event. If the to_csv options say to add a header it will do it for every event.

is there a way to make it called once? so I end up with a proper csv file?
I used this property autogenerate_column_names => true but it didn't help

No, I do not think so.

not even building a whole csv file using ruby instead?

Yes, if you wrote the file in a ruby filter instead of using an output I guess it could be done, since you could add the header in the init.

I wrote this ruby filter to remove a whole row that has a row_number value but it only removed the cell's value

ruby {
            code => '
                hash = event.to_hash
                hash.each do |k,v|
                if v == "row_number"
                    event.remove(k)
                end
            end
            '
    }

because I will not have the row_number in the input and I need to remove it from the output while keeping and populating the row's number in the output even if I don't have it in the input.
I mean that my input might be:

user_name,text,size
Mike,Hello,11.5
Nicolas,Test Test,0.25
Sandy,Test text,1.25

I used this filter instead, and it didn't work:

csv {
        remove_field => ["row_number"]
    }

Why not just remove this line if you do not want row_number in the output?

This will remove it from the output, but I need it in the output even if it is not in the input, eventually the input will look like this

user_name,text,size
Mike,Hello,11.5
Nicolas,Test Test,0.25
Sandy,Test text,1.25

So, there should be an auto-numbering for the rows in the output.

one more thing, would it be possible to have the "headers" after each row removed if I turned the output into JSON instead of csv?

That should work.

would that require changing the ruby code you provided? or the way I'm processing the file? I don't have enough knowledge in that context.

I thought you meant replacing the csv output with a file input plus a json_lines codec.

The input will stay as a csv file like this:

user_name,text,size
Mike,Hello,11.5
Nicolas,Test Test,0.25
Sandy,Test text,1.25

and the output would be a json file (because I want to push it to elasticsearch to do aggregations on it), the json schema will something like this (auto row detection and put it in the output):

[
 {
   "column_name": "user_name",
   "column_value_string": "mike",
   "column_value_float": null,
   "row_number": 1
 },
 {
   "column_name": "text",
   "column_value_string": "hello",
   "column_value_float": null,
   "row_number": 1
 },
 {
   "column_name": "size",
   "column_value_string": "",
   "column_value_float": 11.5,
   "row_number": 1
 },
 {
   "column_name": "user_name",
   "column_value_string": "Nicolas",
   "column_value_float": null,
   "row_number": 2
 },
 {
   "column_name": "text",
   "column_value_string": "Test Test",
   "column_value_float": null,
   "row_number": 2
 },
 {
   "column_name": "size",
   "column_value_string": "",
   "column_value_float": 0.25,
   "row_number": 2
 },
 {
   "column_name": "user_name",
   "column_value_string": "Sandy",
   "column_value_float": null,
   "row_number": 3
 },
 {
   "column_name": "text",
   "column_value_string": "Test Text",
   "column_value_float": null,
   "row_number": 3
 },
 {
   "column_name": "size",
   "column_value_string": "",
   "column_value_float": 1.25,
   "row_number": 3
 }
]

@Badger Thanks for the help you offered, I created a new topic that covers different output for the same input, I'm lost a little bit.