I want to remove the duplicate event based on particular field of my input
i wrote logic like following but i got an error
aggregate {
task_id => "%{[meta][ingestionHash]}"
code => "
map['@metadata']['keep'] ||= event.get('[@metadata][first_event]') ? false : true
event.set('[@metadata][first_event]', true)
"
end_of_task => true
}
In input i have a field name "ingestionHash" and based on this field i want to remove duplicate event. if two events having the same ingestionHash value then i want ignore the second event..
aggregate {
task_id => "%{[meta][ingestionHash]}"
code => "
map['@metadata']['keep'] ||= event.get('[@metadata][first_event]') ? false : true
event.set('[@metadata][first_event]', true)
"
end_of_task => true
}
if ![[@metadata][keep]] {
drop { }
}
this is my code for remove deduplicate event . is it correct ?
Why not use this field as the document_id in your elasticsearch output?
The deduplication needs to be done in Elasticsearch not Logstash.
I do not use the aggregation filter, so I can not tell if your code is right or not, but I do not think that even if it is right it will work as you want.
The aggregate filter has a timeout, a time range in which it will agregate the events, if you receive two events with the same ingestionHash value, but they come outside this time range, the aggregate filter will do nothing.
For example, if you receive an event with the ingestionHash value of abcd-1234 now and receive the same ingestionHash 15 minutes later, they will note be aggregate because the timeout for the aggregate filter will already be expired.
instead of using document_id, is there any another way to remove deduplicate event based on some field value. because to create a document_id i have use fingerprint filter plugin.
following is my config file
input {
}
filter {
mutate{
split => ["topicParts", "."]
add_field => { "dataType" => "%{[topicParts][2]}" }
add_field => { "ingestKeyHash" => "" }
}
fingerprint {
source => ["[meta][recordKeys]"]
method => "MURMUR3"
target => "ingestKeyHash"
}
}
OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance.
(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.