Duplicate records in indexes

SirJune · August 8, 2019, 1:51pm

HI,
I have logstash running connecting/reading records from MySQL DB. I have 7 records in a timeframe (say: 01:00:00 - 01:01:00)
I have this config file:

tracking_column => "datetime"
use_column_value =>true
statement => "SELECT hostname,datetime,count FROM callcountTable;"
schedule => "*/20 * * * *"

output {
elasticsearch {
hosts => ["logstashserver01:9200"]
index => "callcount-logstash-%{+YYYY.MM.dd}"
document_id => "%{platform}%{datetime}"
}
stdout { codec => rubydebug }
}

&
indexpattern= "callcount*"

This morning, I found 14 records. 7 from yesterdays' index and 7 from today.

Is this an expected behavior that same set of records will be created on next days' index?
Is there some config i need such that today's index will only have records where datetime=today?

possible solution/workaround:

I am thinking of removing the date "YYYY.MM.dd" in the index line and just have one index. So far I have 1.8 million records in the DB table.

I'd appreciate feedbacks/tips.

thanks,
Ricardo

Jenni · August 8, 2019, 2:14pm

You set a tracking_column, but didn't use sql_last_value. Therefore you keep processing the same entries every 20 minutes.
And your index name is based on the current timestamp, not the datetime column. So Logstash imported the same data into the same index all day long yesterday, but the document_id was always the same, so there were no duplicates. It just keeps doing the same today with da new index. That's why there are now duplicates.
I don't think the import time matters to you, so you could use sql_last_value to only import the new data (where datetime > :sql_last_value), then copy the datetime value into @timestamp because %{+YYYY.MM.dd} is based on the @timestamp column of the event. (If you want to keep @timestamp as the import time and not overwrite it, we could alternatively use Ruby with the strftime function to create the string for the index name based on datetime.)

SirJune · August 14, 2019, 8:40pm

Thank you Jenni.

So i now have this:

use_column_value =>true
tracking_column_type => "timestamp"
clean_run => true
tracking_column => "datetime"
last_run_metadata_path => "/var/log/test_logstash_jdbc_last_run"
statement => "SELECT datetime,count FROM callcountTable where datetime >= '2019-08-14 00:00:00' order by datetime >:sql_last_value;"

it fetched 1980 records and it matches the info of the last row/record in the MySQL database.
$ cat /var/log/test_logstash_jdbc_last_run
--- 2019-08-14 16:13:38.000000000 -04:00

After X minutes, 4 new records are in the MySQL DB. I ran the Logstash query again and it added the 4 new records, and it updated
the last_run file.
$ cat /var/log/test_logstash_jdbc_last_run
--- 2019-08-14 16:18:43.000000000 -04:00

So far so good. Though i noticed in the stdout, it displayed all 1980++ records again.
So the time it took for logstash query to finish is much more longer. I thought, Logstash will only fetch the "new records" so that
execution time will be like fast. Is this an expected behaviour? myTable has 1.8 million rows if i remove the date range filter.

Jenni · August 18, 2019, 3:49pm

statement => "SELECT datetime,count FROM callcountTable where datetime >= '2019-08-14 00:00:00' order by datetime >:sql_last_value;"

In this statement you sort the results by a boolean value (“Is datetime larger than the tracking value?” – Yes/No) instead of filtering them. So naturally there will be all the old entries again.

Isn't this what you wanted to do?

statement => "SELECT datetime,count FROM callcountTable where datetime >= '2019-08-14 00:00:00' AND datetime >:sql_last_value ORDER BY datetime ASC;"

SirJune · August 20, 2019, 9:00pm

Thanks much Jenni! I think I'm getting the right count of records now. I will post an update after some more testing. BTW, ascencing (ASC) is giving a syntax error when run in logstash. Anyways, the default is ASC. clean_run => true also messed up things on me. I disabled it which defaults to false.

system · September 17, 2019, 9:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[solved] How to work with sql_last_value on table that has multiple rows on same second Logstash	5	2230	May 24, 2018
Logstash produces duplicates Logstash	3	1159	July 6, 2017
Getting duplicate records with logstash JDBC plugin Logstash	4	3385	February 23, 2017
Logstash Sincedb duplicate entries Logstash	2	187	November 29, 2023
Duplicated date in my elastic Logstash	6	320	November 1, 2022

Duplicate records in indexes

Related topics