Merge 3 CSV files based on related column


(slimshady) #1

Hi,
I am trying to merge 3 csv files which have columns like:
1st CSV: [oid, symbol, price, qty, side, err1]
2nd CSV: [oid, tid, symbol, price, qty, side, err2, error_msg]
3rd CSV: [tid, symbol, price, qty, side, err3]

Now as you can see, i have to map oid of 1st csv with tid of 2nd csv and also tid of 3rd csv with oid of 2nd csv.

I think it is very similar to outer join queries in relational databases.

So what approach should I follow for such kind of relationships between log files?

PS: I have already read many FAQs related to this and tried using Aggregation, Parent-child relationship, but no help :frowning:


(Magnus B├Ąck) #2

Logstash isn't the right tool for that task.


(slimshady) #3

@magnusbaeck Thanks for replying!

So can you please give me some insight about the tools that will be required? Or can you give me the link of some blog related to similar use cases.


(slimshady) #4

I implemented the same using ruby filter plugin.
My logstash config is something like this :-

input
{
file
{
path => "/usr/share/logstash/bin/csvt1/*.csv"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}

filter
{

if "Book0" in [path] {
mutate {
replace => {
"type" => "Book0"
}
}
}

if "Book1" in [path] {
mutate {
replace => {
"type" => "Book1"
}
}
}

if "Book2" in [path] {
mutate {
replace => {
"type" => "Book2"
}
}
}

if [type] == "Book0"
{
csv
{
autodetect_column_names => true
separator => ","
convert => {
"timestamp" => "date_time"
"price" => "float"
"quantity" => "integer"
"brokerid" => "integer"
"clientid" => "integer"
"orderid" => "integer"
"tradeid" => "integer"
}
}

    date{
            match => ["timestamp" , "dd-MM-yyyy HH:mm:ss"]
            timezone => "Asia/Kolkata"
    }

}

if [type] == "Book1"
{
csv
{
autodetect_column_names => true
separator => ","
convert => {
"timestamp" => "date_time"
"price" => "float"
"quantity" => "integer"
"brokerid" => "integer"
"clientid" => "integer"
"orderid" => "integer"
}
}

date{
	match => ["timestamp" , "dd-MM-yyyy HH:mm:ss"]
	timezone => "Asia/Kolkata"	
}

}

if [type] == "Book2"
{
csv
{
autodetect_column_names => true
separator => ","
convert => {
"timestamp" => "date_time"
"price" => "float"
"quantity" => "integer"
"brokerid" => "integer"
"clientid" => "integer"
"tradeid" => "integer"
}
}

    date{
            match => ["timestamp" , "dd-MM-yyyy HH:mm:ss"]
            timezone => "Asia/Kolkata"
    }

}

ruby
{
init => "
@@map_to = {};
@@map_ot = {};
"
code => '
if event.get("tradeid") and event.get("orderid")
@@map_to[event.get("tradeid")] = event.get("orderid");
@@map_ot[event.get("orderid")] = event.get("tradeid");
elsif event.get("tradeid")
event.set("orderid" , @@map_to.fetch(event.get("tradeid") , nil));
elsif event.get("orderid")
event.set("tradeid" , @@map_ot.fetch(event.get("orderid"), nil));
end
'
}

}

output
{
stdout
{
codec => rubydebug
}
}

But now i am getting nil for most the values. I thought the problem is occurring due to the multiple pipeline workers, so i tried running logstash with following command

./logstash -f file.conf -w 1

But even then also I am getting nil values.
Thanks in advance!


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.