Hello,
I am using this pipeline to enrich my SIEM with URLhaus information:
input {
exec {
command => 'curl https://urlhaus.abuse.ch/downloads/csv/ --output text.zip && unzip -c text.zip'
interval => 86400
type => 'iphaus'
codec => line
}
}
filter {
if [type] == "iphaus" {
csv {
columns => ["id","dateadded","url","url_status","threat","tags","urlhaus_link","reporter"]
separator => ","
}
mutate {
remove_field => ["message"]
}
}
}
output {
elasticsearch {
hosts => ["https://X.X.X.X:9200"]
index => "malware-%{+YYYY.MM.dd}"
cacert => 'ca.crt'
user => "elastic"
password => "password"
}
}
The only problem is that by downloading it every 24h, I am getting redundancy in my index, ( the same URL is present in the database of yesterday and of today).
I would like to know how can I match the url to my index url field, to see if it's already present or not, and then index it if it's not already indexed.
Thanks for your help.