Use logstash_checksum as tracking_column

Hi all

Logstash creates perfectly a checksum from my row data

  "_index": "rel",
  "_type": "_doc",
  "_id": "2",
  "_version": 5,
  "_score": 0,
  "_source": {
    "datecreated": "2016-06-09T00:00:00.000Z",
    "datemodified": "2019-11-20T00:00:00.000Z",
    "modifiedby": null,
    "@timestamp": "2019-11-25T09:04:02.730Z",
    "rel_id": 4,
    "ts_id": null,
     "md_id": null,
    "createdby": "SYSTEM",
    "ng": false,
    "logstash_checksum": "5694ec39969161d4b74d716001f81c0b",
    "id": 2,
    "isactive": true,
    "@version": "1"

But i get an error when i'm trying to get this checksum to work as tracking column for updated rows

[main] Exception when executing JDBC query {:exception=>#<Sequel::DatabaseError: Java::ComMicrosoftSqlserverJdbc::SQLServerException: Incorrect syntax near '}'.>}

this is my conf file

input {  
jdbc {
    # Query for testing purpose
    statement => "SELECT * from gln where {logstash_checksum} > :sql_last_value"
    tracking_column => "{logstash_checksum}"
    #tracking_column_type => "binary"
    use_column_value => true   
    #clean_run => true        
    schedule => "* * * * *"
    checksum {
   algorithm => 'md5' 
    # fingerprint{}
output {  
    elasticsearch {
        hosts => ["localhost:9200"]
        index =>  "rel"
        document_id => "%{id}"    
stdout { codec => rubydebug }

any idea what i'm doing wrong ?

thx in advance

Does the logstash_checksum column exist in the database? If so, remove the {}. If not, you cannot reference it in the input. I would expect it to get added by the checksum filter.

There is no reason to use md5. It is broken and has been for nearly two decades. Continuing to use it to detect accidental duplication just encourages folks to continue to use it for cryptographic purposes for which it is completely unfit. Why not use the default algorithm?

Hi Badger

the logstash_checksum does not exist in my table. I just want to have a way where i can import my sql data to elastic and that I can check whether there is an update on a row or not, so that not all my data has to be pushed to elastic search everytime. Are there examples on how to manage that ?

As for the algorithm, i was just trying something.

thx for the response

If there is no tracking column in the database then you are going to have to fetch all of the data repeatedly and detect duplicates. You could do that with a checksum filter (or in any version of logstash from the last few years, a fingerprint filter) and then set the document_id on the elasticsearch output to the checksum.

Isn't there a huge load on your db doing it this way ? what if my tables have millions of rows and they all have to be fetched just to check whether there is a change or not ?

or am i missing something ?

Yes. But if there is no tracking column in the database there is no alternative. If you have a column that is a timestamp or sequence that tracks updates then use it, and you will only fetch updated rows.

I have a column in my table that is a timestamp in binary, but logstash imports it like this

<Sequel::SQL::Blob:0x9b18 bytes=8 content="\u0000\u0000\u0000\u0000\u0001\xE6\u0016\xEF"
instead of

any idea why this happens ?

No. You should create a new question about that.

I did

Anyway thanks for the patience and help

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.