[Threat Intelligence]: Avoid redundancy of information in the same index

TheHunter1 · January 6, 2021, 9:04am

Hello,

I am using this pipeline to enrich my SIEM with URLhaus information:

input {
  exec {
    command => 'curl https://urlhaus.abuse.ch/downloads/csv/ --output text.zip && unzip -c text.zip'
    interval => 86400
    type => 'iphaus'
    codec => line
  }
}
filter {
  if [type] == "iphaus" {
    csv {
      columns => ["id","dateadded","url","url_status","threat","tags","urlhaus_link","reporter"]
      separator => ","
    }
    mutate {
      remove_field => ["message"]
    }
  }
}

output {
  elasticsearch {
    hosts => ["https://X.X.X.X:9200"]
    index => "malware-%{+YYYY.MM.dd}"
    cacert => 'ca.crt'
    user => "elastic"
    password => "password"
  }
}

The only problem is that by downloading it every 24h, I am getting redundancy in my index, ( the same URL is present in the database of yesterday and of today).

I would like to know how can I match the url to my index url field, to see if it's already present or not, and then index it if it's not already indexed.

Thanks for your help.

Badger · January 6, 2021, 4:31pm

If you set the document_id on the elasticsearch output based on some field, or a fingerprint generated from multiple fields of the document, then the elasticsearch output will overwrite the document instead of inserting a duplicate.

TheHunter1 · January 6, 2021, 4:37pm

Thanks for your answer @Badger,
Could you tell me please how can I set that in my Logstash configuration !

Badger · January 6, 2021, 4:50pm

To generate a fingerprint use a filter

fingerprint {
    concatenate_sources => true 
    method => "SHA256" 
    source => [ "url" ] # And possibly other fields
    target => "[@metadata][fingerprint]"
}

then reference it in the elasticsearch output using a sprintf reference

document_id => "%{[@metadata][fingerprint]}"

TheHunter1 · January 7, 2021, 8:11am

Thank you very much @Badger,

I just tried it and it's working like a charm

system · February 4, 2021, 8:12am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash Data Enrichment with URLHAUS Logstash	8	1130	October 31, 2018
Logstash v6.2.2 - Elasticsearch plugin Query - Duplicate Data Logstash	1	633	November 6, 2018
Logstash x handling duplicate Logstash	3	222	December 21, 2022
ES query to check the existence of a document_id? Logstash	10	983	June 26, 2020
Duplicate entries in elastic for the same message Logstash	4	541	April 13, 2023

[Threat Intelligence]: Avoid redundancy of information in the same index

Related topics