Handling duplicates from SQL Dumps/json and API response in Logstash


(Mike W) #1

I am just about to start importing SQL Dumps and API responses from one of the systems. But I have realised that those data dumps and API responses will contain same information all the time, but also:

  • some of the data might get updated (like tables with user details and last_login_time)
  • some of the data might get removed (user has removed their account)
  • some of the data might be added (new users added).

How do I handle this in ES? sincedb_path is not helpful at all , this is useful for streaming data only. Even if it once detected that SQL Dump had only one record, filters failed because logstash tried to use filters on the new data only. Why? Because dump is in JSON format, and filters run 'split' module first which obviously doesn't work with just that tiny piece of data which has changed.

Any ideas?


(Mark Walkom) #2

Find some unique, but static values to stitch together to form a _id and then use that for the doc. Then if an update occurs it will just update (overwrite) the existing document.


#3

use scheduler, so that it will update the data for every 1 minute as shown below,

The input code is fine, in the output did you mention elasticsearch to get the result,

Try this,

input {

jdbc {
jdbc_driver_library => "xxxx\oracle-10g\ojdbc14.jar"
jdbc_driver_class => "oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "jdbc:oracle:thin:@localhost:1521:DATABASE"
jdbc_user => "ROMAINROM"
jdbc_password => "ROMAINROM"
statement => "SELECT TOP 10 * FROM TABLE"
jdbc_paging_enabled => "true"
jdbc_page_size => "50000"
schedule => "*/1 * * * *"
}

}

output{
elasticsearch { codec => json hosts => ["localhost:9200"] index => "index9" }
stdout { codec => rubydebug }
}


Logstash deleting records from Db Table after running logstash.conf file
(Mike W) #5

I am sorry....WHAT?


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.