How can i count the entries of duplicate data when deduplicating in fingerprint and provide feedback in some form (such as in log form)?

WeirdorPersist · November 5, 2024, 9:02am

I am using the fingerprint plugin to deduplicate data. I hope to obtain the duplicate entries of each piece of data. What should I do?This is my configuration file.

input {
  beats {
    port => 5044
  }
}
filter{
    grok{
 	match => {
	 #   "message" => "%{TIMESTAMP_ISO8601:timestamp} %{HOSTNAME:host} %{DATA:process_name}(?:\[%{NUMBER:pid}\])?: %{GREEDYDATA:log_message}"
	     message => "(?:%{TIMESTAMP_ISO8601:timestamp})? (?:%{DATA:hostname})? (?:%{DATA:process_name})?(?:\[%{NUMBER:pid}\])?:(?:%{GREEDYDATA:log_message})?"	
}
	}
    # 标记缺失的 pid 字段
    if ![timestamp]{
	mutate{
	     add_field => { "timestamp" => "N/A"}
		}
	mutate{
	     add_tag => ["missing_timestamp"]
		}
	}
    if ![hostname]{
	mutate{
             add_field => { "hostname" => "N/A"}
                }
        mutate{
             add_tag => ["missing_hostname"]
                }
        }
    if ![process_name]{
        mutate{
             add_field => { "process_name" => "N/A"}
                }
        mutate{
             add_tag => ["missing_process_name"]
                }
        }
    if ![pid] {
        mutate {
             add_field => { "pid" => "N/A" }  
# 给缺失的 pid 赋默认值
    }
        mutate {
             add_tag => ["missing_pid"]  
# 添加标记，指示该条日志缺少 pid
    }
  }
    if ![log_message] {
	mutate{
             add_field => { "log_message" => "N/A"}
                }
        mutate{
             add_tag => ["missing_log_message"]
                }
	}
    mutate{
	remove_field => ["event", "log", "@version", "@timestamp", "message", "host"]
	}
    fingerprint{
	source => ["hostname", "process_name", "pid", "log_message"]
	target => "[@metadata][generated_id]"
	method => "SHA256"
	concatenate_sources => true
	}
}
output {
  elasticsearch {
    hosts => ["http://192.168.52.130:9200"]
   # index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
    index => "%{[fields][node]}-%{+YYYY.MM.dd}"
    document_id => "%{[@metadata][generated_id]}"
    #user => "elastic"
    #password => "changeme"
  }
}

Why is the indexed date I configured not taking effect?

index => "%{[fields][node]}-%{+YYYY.MM.dd}"

The log format is as follows.

2024-10-15T00:00:03.172528+08:00 yp-VMware-Virtual-Platform [1]: rsyslog.service: Sent signal SIGHUP to main process 1346 (rsyslogd) on client request.

Is the following method a good method? Do I need to change to another way to handle deduplication?

filter {
  # 初始化 fingerprints 字段为空数组
  mutate {
    add_field => { "fingerprints" => [] }
  }

  # 假设已经有了其他的 filter 配置
  mutate {
    add_field => {
      "[duplicate_count]" => "%{[@metadata][total_duplicates]}" # 初始值设置为0
    }
  }

  # 生成指纹
  fingerprint {
    source => ["hostname", "process_name", "pid", "log_message"]
    target => "[@metadata][generated_id]"
    method => "SHA256"
    concatenate_sources => true
  }

  # 检查是否是重复事件
  if "[@metadata][generated_id]" in [fingerprints] {
    # 如果是重复，增加 duplicate_count
    mutate {
      add_field => {
        "[duplicate_count]" => "%{[@metadata][total_duplicates]} + 1"
      }
    }
    mutate {
      remove_field => ["[@metadata][total_duplicates]"]
    }
  } else {
    # 如果是新事件，将 generated_id 添加到 fingerprints 数组中
    mutate {
      push => { "fingerprints" => "%{[@metadata][generated_id]}" }
    }
  }
}

Badger · November 5, 2024, 11:32am

The index option on an elasticsearch output is sometimes ignored. That will be true if you are using data streams, and, if I recall correctly, if ILM is resetting it.

There is no mutate+push operation.

I'm not convinced trying to do this in logstash is the right approach. You need a database of fingerprints you have already seen. Should that be pre-populated by querying elasticsearch? If not, is it OK that it gets reset every time logstash restarts? Or do you need to manage persistence yourself?

elasticsearch already handles this by updating the _version field every time a document is updated. If logstash is the only thing updating documents you can detect duplicates by looking at documents where _version is not 1.

WeirdorPersist · November 5, 2024, 12:21pm

Thank you for answering my question.How can I view the information of the _version field in Kibana?

Topic		Replies	Views
Question about fingerprint and de-duplicating Logstash	3	829	July 6, 2017
Can we avoid duplicate records with fingerprint plugin or read input only once? Logstash	1	508	February 11, 2018
Dupilcate message on elasticsearch Logstash	6	916	December 31, 2017
Get rid of duplicated items Logstash	4	1244	July 6, 2017
Duplicate Entries of Log data Elasticsearch	6	4757	September 29, 2017

How can i count the entries of duplicate data when deduplicating in fingerprint and provide feedback in some form (such as in log form)?

Related topics