Hi friends,
I am facing a serious issue of duplicate documents found in ES after indexing...
log format
//
10.8.1.90,23-11-2019 00:00:07,CURRENT,Broadband,Idea/Vodafone,Failure
10.8.1.90,23-11-2019 01:30:08,CURRENT,Broadband,Idea/Vodafone,Failure
10.8.1.80,23-11-2019 00:00:09,CURRENT,Broadband,Idea/Vodafone,Success
10.8.1.80,23-11-2019 01:30:08,CURRENT,Idea/Vodafone,Broadband,Success
10.8.1.74,23-11-2019 10:06:02,CURRENT,Idea/Vodafone,Broadband,Failure
10.8.1.74,23-11-2019 10:07:02,CURRENT,Idea/Vodafone,Broadband,Failure
10.8.1.74,23-11-2019 10:08:02,CURRENT,Idea/Vodafone,Broadband,Failure
//
and I am using the following configuration file for indexing...
//
input {
file {
path => "/home/logs/connection.csv"
max_open_files => 17000
start_position => "beginning"
sincedb_path => "/home/ALL/since_connection_november.db"
}
}
filter {
#10.8.0.68,10-11-2019 12:00:12,CURRENT,Broadband,Idea/Vodafone,Success
#10.8.0.68,10-11-2019 13:30:14,CURRENT,Idea/Vodafone,Broadband,Failure
dissect {
mapping => {
"message" => "%{ip},%{happened_instant},%{time_slot},%{provider_from},%{provider_to},%{status}"
}
}
date {
match => [ "happened_instant", "dd-MM-yyyy HH:mm:ss" ]
target => "failover_date"
}
}
output {
elasticsearch {
hosts => "server_domain:9292"
index => "connection_november"
}
stdout {}
}
//
I am using the failover_date date target as timestamp while indexing.
My file line count is almost 2744744 records but after indexing and checking discovery found that its showing hits of almost 4 times more ie 8745544 records.
When I checked for the duplicate data, found that each document is inserted multiple times with difference only in @timestamp.
ie same document inserted multiple times at different time..
Please find the format of two same docs at with different timestamp which I found while expanding the docs in discover tab...
first doc
////////
{
"_index": "connection_november",
"_type": "_doc",
"_id": "aEEdn24BNAK6JMz4dK_F",
"_version": 1,
"_score": null,
"_source": {
"ip": "10.8.1.90",
"failover_date": "2019-11-22T18:30:07.000Z",
"host": "swupdate",
"provider_from": "Broadband",
"message": "10.8.1.90,23-11-2019 00:00:07,CURRENT,Broadband,Idea/Vodafone,Failure",
"@timestamp": "2019-11-24T20:32:17.903Z",
"@version": "1",
"status": "Failure",
"happened_instant": "23-11-2019 00:00:07",
"path": "/home/logs/connection.csv",
"time_slot": "CURRENT",
"provider_to": "Idea/Vodafone"
},
"fields": {
"failover_date": [
"2019-11-22T18:30:07.000Z"
],
"@timestamp": [
"2019-11-24T20:32:17.903Z"
]
},
"sort": [
1574627537903
]
}
////
second doc
////
{
"_index": "connection_november",
"_type": "_doc",
"_id": "2vH3mW4BNAK6JMz4Btpt",
"_version": 1,
"_score": null,
"_source": {
"ip": "10.8.1.90",
"failover_date": "2019-11-22T18:30:07.000Z",
"host": "swupdate",
"provider_from": "Broadband",
"message": "10.8.1.90,23-11-2019 00:00:07,CURRENT,Broadband,Idea/Vodafone,Failure",
"@timestamp": "2019-11-23T20:32:13.228Z",
"@version": "1",
"status": "Failure",
"happened_instant": "23-11-2019 00:00:07",
"path": "/home/logs/connection.csv",
"time_slot": "CURRENT",
"provider_to": "Idea/Vodafone"
},
"fields": {
"failover_date": [
"2019-11-22T18:30:07.000Z"
],
"@timestamp": [
"2019-11-23T20:32:13.228Z"
]
},
"sort": [
1574541133228
]
}
///
Please note that the bolded portion is the same one with different timestamb.
Can anyone suggest me why this issue occurs...?,or can suggest any solution...
Please note that I am new to ES..