I've been playing around with logstash and I'm fowarding syslog data to my ES cluster, which works great. Now I wanted to collect logdata that's stored in SQL databases as well, and found something about a SQL plugin for logstash. That all sounds great, but I'm missing one tiny thing in the documentation:
How does logstash/the plugin keep track of what records were already pulled from the DB? I didn't see any parameter that would let me specify a tracking field for instance (like ID, or date_created or whatever)...
Anyone know how that's supposed to work? I just don't want:
- duplicate data in ES
- performance issues because the plugin does some weird "oh, let me check whats already here" for every query,...
Assuming you're talking about the jdbc input plugin, did you look at the sql_last_start parameter that's described in the documentation?
Yep, that's the one.
And yes, I did look at the documentation and also have seen that parameter. But from the documentation it is not clear to me to what happens with that timestamp regarding my logdata. Does the plugin look for a timestamp in the logs and compares it? If so, does my timestamp in the logs need a specific format?
It's your job to include the sql_last_start parameter in your query, and it assumes that you have an timestamp field that you can use. Logstash will make sure that the parameter is populated with the time the query was run the last time. In other words your query could look like this:
SELECT * FROM mytable WHERE timestamp >= :sql_last_start
(Not sure about the exact syntax for named parameters but I think you get the idea.)
For example, the first time you run this sql_last_start will be 1970-01-01 00:00:00 and you'll get all rows. The second run sql_last_start will be (for example) 2015-12-03 10:55:00 and the query will return all rows with a timestamp newer than that.
Ah, ok! That's exactly what I was missing...perfect. So it works the way I'm used to do things,...great. Many thanks for your help!
Have a question regarding sql_last_start parameter, in particular when it's updated and flushed to disk.
Data import is configured to run every 15 mins and record last run timestamp to the specified file :
schedule => "*/15 * * * *"
record_last_run => true
last_run_metadata_path => "/var/lib/logstash/.logstash_jdbc_last_run"
Initial import worked fine (logstash has been started as a service), logstash service keeps running and pulling updates. However only initial import pulled all of the data, further imports seem to pull partial data. Not sure how to check if :sql_last_start while service is running since it seems to be flushed to disk only when the service is stopped.
There were db connection problems reportedin logstash log file (errors were flushed to disk with the delay) , but it seems that timestamp of the last run has still been updated even though the import wasn't successful.
Just went through plugin documentation in https://github.com/logstash-plugins/logstash-input-jdbc/blob/master/lib/logstash/inputs/jdbc.rb:
" Upon shutting down, this file will be updated with the current value of
sql_last_start. Next time the pipeline starts up, this value will be updated by reading from the file."
How does it work with scheduling since the service keeps running?