Duplicate documents found in ES after indexing

bibin · November 27, 2019, 9:47am

Hi friends,
I am facing a serious issue of duplicate documents found in ES after indexing...

log format
//
10.8.1.90,23-11-2019 00:00:07,CURRENT,Broadband,Idea/Vodafone,Failure
10.8.1.90,23-11-2019 01:30:08,CURRENT,Broadband,Idea/Vodafone,Failure
10.8.1.80,23-11-2019 00:00:09,CURRENT,Broadband,Idea/Vodafone,Success
10.8.1.80,23-11-2019 01:30:08,CURRENT,Idea/Vodafone,Broadband,Success
10.8.1.74,23-11-2019 10:06:02,CURRENT,Idea/Vodafone,Broadband,Failure
10.8.1.74,23-11-2019 10:07:02,CURRENT,Idea/Vodafone,Broadband,Failure
10.8.1.74,23-11-2019 10:08:02,CURRENT,Idea/Vodafone,Broadband,Failure
//

and I am using the following configuration file for indexing...

//
input {
file {
path => "/home/logs/connection.csv"
max_open_files => 17000
start_position => "beginning"
sincedb_path => "/home/ALL/since_connection_november.db"
}
}
filter {
#10.8.0.68,10-11-2019 12:00:12,CURRENT,Broadband,Idea/Vodafone,Success
#10.8.0.68,10-11-2019 13:30:14,CURRENT,Idea/Vodafone,Broadband,Failure
dissect {
mapping => {
"message" => "%{ip},%{happened_instant},%{time_slot},%{provider_from},%{provider_to},%{status}"
}
}
date {
match => [ "happened_instant", "dd-MM-yyyy HH:mm:ss" ]
target => "failover_date"
}
}
output {
elasticsearch {
hosts => "server_domain:9292"
index => "connection_november"
}
stdout {}
}
//

I am using the failover_date date target as timestamp while indexing.

My file line count is almost 2744744 records but after indexing and checking discovery found that its showing hits of almost 4 times more ie 8745544 records.

When I checked for the duplicate data, found that each document is inserted multiple times with difference only in @timestamp.

ie same document inserted multiple times at different time..

Please find the format of two same docs at with different timestamp which I found while expanding the docs in discover tab...

first doc
////////
{
"_index": "connection_november",
"_type": "_doc",
"_id": "aEEdn24BNAK6JMz4dK_F",
"_version": 1,
"_score": null,
"_source": {
"ip": "10.8.1.90",
"failover_date": "2019-11-22T18:30:07.000Z",
"host": "swupdate",
"provider_from": "Broadband",
"message": "10.8.1.90,23-11-2019 00:00:07,CURRENT,Broadband,Idea/Vodafone,Failure",
"@timestamp": "2019-11-24T20:32:17.903Z",
"@version": "1",
"status": "Failure",
"happened_instant": "23-11-2019 00:00:07",
"path": "/home/logs/connection.csv",
"time_slot": "CURRENT",
"provider_to": "Idea/Vodafone"
},
"fields": {
"failover_date": [
"2019-11-22T18:30:07.000Z"
],
"@timestamp": [
"2019-11-24T20:32:17.903Z"
]
},
"sort": [
1574627537903
]
}
////

second doc
////
{
"_index": "connection_november",
"_type": "_doc",
"_id": "2vH3mW4BNAK6JMz4Btpt",
"_version": 1,
"_score": null,
"_source": {
"ip": "10.8.1.90",
"failover_date": "2019-11-22T18:30:07.000Z",
"host": "swupdate",
"provider_from": "Broadband",
"message": "10.8.1.90,23-11-2019 00:00:07,CURRENT,Broadband,Idea/Vodafone,Failure",
"@timestamp": "2019-11-23T20:32:13.228Z",
"@version": "1",
"status": "Failure",
"happened_instant": "23-11-2019 00:00:07",
"path": "/home/logs/connection.csv",
"time_slot": "CURRENT",
"provider_to": "Idea/Vodafone"
},
"fields": {
"failover_date": [
"2019-11-22T18:30:07.000Z"
],
"@timestamp": [
"2019-11-23T20:32:13.228Z"
]
},
"sort": [
1574541133228
]
}
///

Please note that the bolded portion is the same one with different timestamb.

Can anyone suggest me why this issue occurs...?,or can suggest any solution...
Please note that I am new to ES..

Christian_Dahlqvist · November 27, 2019, 12:07pm

How is data added to the file you are reading from?

bibin · November 27, 2019, 12:13pm

Hi Christian,

Its been added simultaneously from different edge devices like beaglebone..
Its been appended by almost 250 devices ..

Christian_Dahlqvist · November 27, 2019, 1:14pm

Is this a local file being appended to? Does inode change when updated?

bibin · November 28, 2019, 6:10am

Is this a local file being appended to?
yes.It is a local file in the server in which logstash is running.but it is updated by many remote edge devices via internet every 5mins.

Did you mean .. whether my logstash and ES is running in same server..? 
   NO both are in separate servers.

Does inode change when updated?

yes, it will change inode of the server in which logstash is running,because there are many other files which are transfered simultaneously to this server.

inode count is 67033.

bibin · December 5, 2019, 8:25am

Hi Christain,
Can you help me with this issue...?
Does the above provided details regarding inode is sufficient..?

Thanks and regards

Christian_Dahlqvist · December 5, 2019, 8:39am

Does the inode of the file (not inode count) change as it is being updated?

bibin · December 9, 2019, 7:49am

Yes,christian_D
Its being changed on time.
why the inode is changing because the main input file is changed frequently by edge devices simultaneously.

Christian_Dahlqvist · December 9, 2019, 8:25am

If it changes the file will look like a new file and be reprocessed. It is exoected that data is appended and the inode does not change.

bibin · December 12, 2019, 5:25pm

Hi Christian ,
Thanks for the help you have rendered so far.I understood the cause of my issue.As you said its because of the change of inode that ES considers my input file as a new file every time its indexing and so indexes it multiple time and atlast resulting in duplicate documents.

bibin · December 12, 2019, 5:29pm

So can you suggest me a solution how to handle my input file which's inode is constanstly changing...?
My input file is a concatination of multiple small files from edge devices which is appended to server all time...

Actually it uses linux rsync to append small files from multiple edge devices to the server.. so that my input file grows every 5 mins ...

Christian_Dahlqvist · December 12, 2019, 8:53pm

Can you install Filebeat on the edge devices?

bibin · December 16, 2019, 4:17pm

Its almost 350 devices...
Is there any other way...?

system · January 13, 2020, 4:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicate log entries Elasticsearch	18	4139	January 20, 2021
Logstash generating duplicated index Logstash	1	478	September 5, 2017
How to avoid elasticsearch duplicate documents Logstash	6	1714	March 5, 2018
Logstash don't detect duplicated documents Logstash	2	299	July 3, 2018
Duplicates in ES Index Logstash	2	276	January 1, 2021

Duplicate documents found in ES after indexing

Related topics