Duplicate documents found in ES after indexing

Hi friends,
I am facing a serious issue of duplicate documents found in ES after indexing...

log format
//
10.8.1.90,23-11-2019 00:00:07,CURRENT,Broadband,Idea/Vodafone,Failure
10.8.1.90,23-11-2019 01:30:08,CURRENT,Broadband,Idea/Vodafone,Failure
10.8.1.80,23-11-2019 00:00:09,CURRENT,Broadband,Idea/Vodafone,Success
10.8.1.80,23-11-2019 01:30:08,CURRENT,Idea/Vodafone,Broadband,Success
10.8.1.74,23-11-2019 10:06:02,CURRENT,Idea/Vodafone,Broadband,Failure
10.8.1.74,23-11-2019 10:07:02,CURRENT,Idea/Vodafone,Broadband,Failure
10.8.1.74,23-11-2019 10:08:02,CURRENT,Idea/Vodafone,Broadband,Failure
//

and I am using the following configuration file for indexing...

//
input {
file {
path => "/home/logs/connection.csv"
max_open_files => 17000
start_position => "beginning"
sincedb_path => "/home/ALL/since_connection_november.db"
}
}
filter {
#10.8.0.68,10-11-2019 12:00:12,CURRENT,Broadband,Idea/Vodafone,Success
#10.8.0.68,10-11-2019 13:30:14,CURRENT,Idea/Vodafone,Broadband,Failure
dissect {
mapping => {
"message" => "%{ip},%{happened_instant},%{time_slot},%{provider_from},%{provider_to},%{status}"
}
}
date {
match => [ "happened_instant", "dd-MM-yyyy HH:mm:ss" ]
target => "failover_date"
}
}
output {
elasticsearch {
hosts => "server_domain:9292"
index => "connection_november"
}
stdout {}
}
//

I am using the failover_date date target as timestamp while indexing.

My file line count is almost 2744744 records but after indexing and checking discovery found that its showing hits of almost 4 times more ie 8745544 records.

When I checked for the duplicate data, found that each document is inserted multiple times with difference only in @timestamp.

ie same document inserted multiple times at different time..

Please find the format of two same docs at with different timestamp which I found while expanding the docs in discover tab...

first doc
////////
{
"_index": "connection_november",
"_type": "_doc",
"_id": "aEEdn24BNAK6JMz4dK_F",
"_version": 1,
"_score": null,
"_source": {
"ip": "10.8.1.90",
"failover_date": "2019-11-22T18:30:07.000Z",
"host": "swupdate",
"provider_from": "Broadband",
"message": "10.8.1.90,23-11-2019 00:00:07,CURRENT,Broadband,Idea/Vodafone,Failure",
"@timestamp": "2019-11-24T20:32:17.903Z",
"@version": "1",
"status": "Failure",
"happened_instant": "23-11-2019 00:00:07",
"path": "/home/logs/connection.csv",
"time_slot": "CURRENT",
"provider_to": "Idea/Vodafone"
},
"fields": {
"failover_date": [
"2019-11-22T18:30:07.000Z"
],
"@timestamp": [
"2019-11-24T20:32:17.903Z"
]
},
"sort": [
1574627537903
]
}
////

second doc
////
{
"_index": "connection_november",
"_type": "_doc",
"_id": "2vH3mW4BNAK6JMz4Btpt",
"_version": 1,
"_score": null,
"_source": {
"ip": "10.8.1.90",
"failover_date": "2019-11-22T18:30:07.000Z",
"host": "swupdate",
"provider_from": "Broadband",
"message": "10.8.1.90,23-11-2019 00:00:07,CURRENT,Broadband,Idea/Vodafone,Failure",
"@timestamp": "2019-11-23T20:32:13.228Z",
"@version": "1",
"status": "Failure",
"happened_instant": "23-11-2019 00:00:07",
"path": "/home/logs/connection.csv",
"time_slot": "CURRENT",
"provider_to": "Idea/Vodafone"
},
"fields": {
"failover_date": [
"2019-11-22T18:30:07.000Z"
],
"@timestamp": [
"2019-11-23T20:32:13.228Z"
]
},
"sort": [
1574541133228
]
}
///

Please note that the bolded portion is the same one with different timestamb.

Can anyone suggest me why this issue occurs...?,or can suggest any solution...
Please note that I am new to ES..

How is data added to the file you are reading from?

Hi Christian,

Its been added simultaneously from different edge devices like beaglebone..
Its been appended by almost 250 devices ..

Is this a local file being appended to? Does inode change when updated?

Is this a local file being appended to?
yes.It is a local file in the server in which logstash is running.but it is updated by many remote edge devices via internet every 5mins.

Did you mean .. whether my logstash and ES is running in same server..? 
   NO both are in separate servers.

Does inode change when updated?

yes, it will change inode of the server in which logstash is running,because there are many other files which are transfered simultaneously to this server.

inode count is 67033.

Hi Christain,
Can you help me with this issue...?
Does the above provided details regarding inode is sufficient..?

Thanks and regards

Does the inode of the file (not inode count) change as it is being updated?

Yes,christian_D
Its being changed on time.
why the inode is changing because the main input file is changed frequently by edge devices simultaneously.

If it changes the file will look like a new file and be reprocessed. It is exoected that data is appended and the inode does not change.

Hi Christian ,
Thanks for the help you have rendered so far.I understood the cause of my issue.As you said its because of the change of inode that ES considers my input file as a new file every time its indexing and so indexes it multiple time and atlast resulting in duplicate documents.

So can you suggest me a solution how to handle my input file which's inode is constanstly changing...?
My input file is a concatination of multiple small files from edge devices which is appended to server all time...

Actually it uses linux rsync to append small files from multiple edge devices to the server.. so that my input file grows every 5 mins ...

Can you install Filebeat on the edge devices?

Its almost 350 devices...
Is there any other way...?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.