JDBC tracking a string column for unique ingestion

I'm trying to ingest just new rows from our database (existent rows never change), but I'm facing some problems on keeping track of what have been already ingested. This fact causes data duplication when sent to ES.

I'm using the JDBC input plugin to query a DB2 database, and my table does not have an unique integer column and the timestamp column is in string format. Sometimes it has some spaces in it.

Below a sample of my input configuration (removed some details):

input {
    jdbc {
        jdbc_connection_string => "jdbc:as400://0.0.0.0/database"
        jdbc_user              => "username"
        jdbc_password_filepath => "/etc/logstash/database.sec"
        jdbc_driver_library    => "/etc/logstash/jt400.jar"
        jdbc_driver_class      => "com.ibm.as400.access.AS400JDBCDriver"
        tracking_column        => "TRANSACTION_TIMESTAMP"
        statement => "SELECT TRANSACTION_TIMESTAMP,
                             FIELD1,
                             FIELD2,
                             FIELD3 FROM TRANSACTIONS WHERE TRANSACTION_TIMESTAMP LIKE '201709%' AND
                             TRANSACTION_TIMESTAMP <> :sql_last_value
                             ORDER BY TRANSACTION_TIMESTAMP"
    }
}

Anybody has any idea of what is going on here? Another question, how can I see the plugin debug log? No luck in /var/log/logstash or adding the flag --debug in the command line.

There is a timestamp in .logstash_jdbc_last_run file, but there is no indication of the field I'm tracking.

regards

This looks like a not equal to. Should this not be strictly greater than to work?

Hi Christian, thank you for your reply. I thought the same when I've started the tests, but either way it seems to not work properly. Is there a way to output what Logstash holds on its "sql_last_value"?

Here there is a piece of my dataset as it comes from the database:

TRANSACTION_TIMESTAMP
20171031235935785
20171031235943328
20171031235945245

Why do you have this in there? I would expect all data that matches that to get read during the first pass, meaning that there will never be any new matches, assuming data is entered in the order off TRANSACTION_TIMESTAMP.

This is because I'm querying a system with 2 years of transactions and I'm testing with a small amount of data until I get it right.

And you are right, the system should ingest all the transactions for September in the first run, and then obey the "sql_last_value" if I run it again, not ingesting all the September dataset again. This is the challenge at the moment. If I run it multiple times, I will get repeated data being sent to ES in every execution.

You need to change the condition to strictly greater than and make sure that you sort in ascending order. If you keep <> I would expect almost all (except the last timestamp) September data to be reloaded.

Christian, thank you again for your support.

I did some changes, but no luck still. I'm getting the data duplicated.
First I've tried the approach you suggested, using the greater than and ordering ASC:

statement => "SELECT TRANSACTION_TIMESTAMP,
                             FIELD1,
                             FIELD2,
                             FIELD3 FROM TRANSACTIONS WHERE TRANSACTION_TIMESTAMP LIKE '201709%' AND
                             TRANSACTION_TIMESTAMP > :sql_last_value
                             ORDER BY TRANSACTION_TIMESTAMP ASC"

Then I played a bit, removing the ORDER BY altogether and also changed the tracking column to lower case (just in case) combined with all the suggestions, but nothing seems to work. How can I see a more verbose output of this operation?

The best way is probably to enable debug logging.

After a break I could come up with an working version of the code, and all is being tracked correctly now. Tailing the .logstash_jdbc_last_run file made it easier to see what was being saved there.

It seems that the tracking_column information must be in lower case in my environment and the use_column_value statement was missing. Adding to all that, the operator ">" that was represented incorrectly in the original code, and lastly the ascending order in my SQL statement.

For those interested, here is the final working version. Thank you very much for the help Christian.

input {
    jdbc {
        jdbc_connection_string => "jdbc:as400://0.0.0.0/database"
        jdbc_user              => "username"
        jdbc_password_filepath => "/etc/logstash/database.sec"
        jdbc_driver_library    => "/etc/logstash/jt400.jar"
        jdbc_driver_class      => "com.ibm.as400.access.AS400JDBCDriver"
        tracking_column        => "transaction_timestamp"   # cAsE mAtteRs!
        use_column_value       => true   # use the field value for tracking purposes.
        statement => "SELECT TRANSACTION_TIMESTAMP,
                             FIELD1,
                             FIELD2,
                             FIELD3 FROM TRANSACTIONS WHERE TRANSACTION_TIMESTAMP LIKE '201709%'
                             AND TRANSACTION_TIMESTAMP > :sql_last_value
                             ORDER BY TRANSACTION_TIMESTAMP ASC"
    }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.