I’m having duplicate records in my indexes. How can I find list of duplicate records ?
Duplicate records have same offset, can you suggest the query to find list of offset with more than one count ?
Or any other way to find duplicate records
I’m having duplicate records in my indexes. How can I find list of duplicate records ?
Duplicate records have same offset, can you suggest the query to find list of offset with more than one count ?
Or any other way to find duplicate records
What do you mean by offset?
log.offset field in Kibana. I see that’s few records are duplicate in Kibana with same log.offset value.
Setup breif: Filebeat - > 2 Logstash -> Elastic
#Filebeat output
output [host1:5044,host2:5044]
loadbalacer: true
This happens only for a few records like 100-500 in 1 Million. How can I fix the existing and avoid it?
You can use the fingerprint filter in Elasticsearch to create your own Elasticsearch document _id
based on the timestamp and the offset, this means duplicates will update the original.
Why there is a duplication issue in the first place? And how about finding the existing duplicate data ? Can you suggest the query to find "log.offset" count > 1
Logstash tries to delivery every document at least once, in some cases it may duplicate the data and this behavior is expected.
To avoid duplicate events in elasticsearch you need to use a custom _id
instead of letting elasticsearch choose the value for the _id
field.
This can be done using an id of your documents, if it exists, or create one based in one or more fields of the documents using the fingerprint
filter.
Check this blog post and this blog post for tips on how to deal with duplicates in Logstash and Elasticsearch.
To find the duplicate events you need to ruin an aggregation query.
Something like this:
GET your-index/_search
{
"size": 0,
"aggs": {
"duplicates": {
"terms": {
"field": "log.offset",
"min_doc_count": 2
}
}
}
}
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.