Duplicate records in indexes

HI,
I have logstash running connecting/reading records from MySQL DB. I have 7 records in a timeframe (say: 01:00:00 - 01:01:00)
I have this config file:

tracking_column => "datetime"
use_column_value =>true
statement => "SELECT hostname,datetime,count FROM callcountTable;"
schedule => "*/20 * * * *"

output {
elasticsearch {
hosts => ["logstashserver01:9200"]
index => "callcount-logstash-%{+YYYY.MM.dd}"
document_id => "%{platform}%{datetime}"
}
stdout { codec => rubydebug }
}

&
indexpattern= "callcount*"

This morning, I found 14 records. 7 from yesterdays' index and 7 from today.

  1. Is this an expected behavior that same set of records will be created on next days' index?
  2. Is there some config i need such that today's index will only have records where datetime=today?

possible solution/workaround:

  1. I am thinking of removing the date "YYYY.MM.dd" in the index line and just have one index. So far I have 1.8 million records in the DB table.

I'd appreciate feedbacks/tips.

thanks,
Ricardo

You set a tracking_column, but didn't use sql_last_value. Therefore you keep processing the same entries every 20 minutes.
And your index name is based on the current timestamp, not the datetime column. So Logstash imported the same data into the same index all day long yesterday, but the document_id was always the same, so there were no duplicates. It just keeps doing the same today with da new index. That's why there are now duplicates.
I don't think the import time matters to you, so you could use sql_last_value to only import the new data (where datetime > :sql_last_value), then copy the datetime value into @timestamp because %{+YYYY.MM.dd} is based on the @timestamp column of the event. (If you want to keep @timestamp as the import time and not overwrite it, we could alternatively use Ruby with the strftime function to create the string for the index name based on datetime.)

Thank you Jenni.

So i now have this:

use_column_value =>true
tracking_column_type => "timestamp"
clean_run => true
tracking_column => "datetime"
last_run_metadata_path => "/var/log/test_logstash_jdbc_last_run"
statement => "SELECT datetime,count FROM callcountTable where datetime >= '2019-08-14 00:00:00' order by datetime >:sql_last_value;"

it fetched 1980 records and it matches the info of the last row/record in the MySQL database.
$ cat /var/log/test_logstash_jdbc_last_run
--- 2019-08-14 16:13:38.000000000 -04:00

After X minutes, 4 new records are in the MySQL DB. I ran the Logstash query again and it added the 4 new records, and it updated
the last_run file.
$ cat /var/log/test_logstash_jdbc_last_run
--- 2019-08-14 16:18:43.000000000 -04:00

So far so good. Though i noticed in the stdout, it displayed all 1980++ records again.
So the time it took for logstash query to finish is much more longer. I thought, Logstash will only fetch the "new records" so that
execution time will be like fast. Is this an expected behaviour? myTable has 1.8 million rows if i remove the date range filter.

statement => "SELECT datetime,count FROM callcountTable where datetime >= '2019-08-14 00:00:00' order by datetime >:sql_last_value;"

In this statement you sort the results by a boolean value (“Is datetime larger than the tracking value?” – Yes/No) instead of filtering them. So naturally there will be all the old entries again.

Isn't this what you wanted to do?

statement => "SELECT datetime,count FROM callcountTable where datetime >= '2019-08-14 00:00:00' AND datetime >:sql_last_value ORDER BY datetime ASC;"

Thanks much Jenni! I think I'm getting the right count of records now. I will post an update after some more testing. BTW, ascencing (ASC) is giving a syntax error when run in logstash. Anyways, the default is ASC. clean_run => true also messed up things on me. I disabled it which defaults to false.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.