Hi
We have logstash running on our worker servers, and they run with an identical config, to ensure if one hosts goes down, we are still processing events.
The source data is Mysql, then logstash pushes the latest records to Elasticsearch
eg
input {
jdbc {
type => event_v4
clean_run => false
jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://xx"
jdbc_user => "xx"
jdbc_password => "xx"
jdbc_paging_enabled => true
jdbc_page_size => 10000
schedule => "*/5 * * * *"
statement =>"xx"
We then schedule the logstash config every 5 mins.
Ideally I want to avoid doubling up the work, ie both servers picking up the same records and pushing them up - it doesn't cause issues but it's a waste of resources. Because we use Ansible to build the servers It's hard to set the cron template differently on each.
I was thinking that once the logstash job runs, we could update the mysql records with a 'lastindexed' date, then in the input/jdbc params, exclude anthing where that date is within the last few mins
eg:
WHERE lastUpdated > :sql_last_value AND lastIndexed < :sql_last_value
Is that possible, how can we update that column in mysql once the records are indexed (we currently only read from Mysql and write to Elastic)
Or any better way to handle this scenario?