[solved] How to work with sql_last_value on table that has multiple rows on same second

skatemple · April 26, 2018, 7:01pm

Hello,
I'm having troubles with a table, it has not a unique id column, so I have to pull data using a datetime column in my "where" statement.

Problem is, that this tables saves the Datetime like this: "2018-04-26 02:36:49.000", so there are no miliseconds, and it makes it to have a LOT of records with the exact same datetime:

This is a trouble, because on the next execution, the query ignores some results because of this part of the statement: "AND L.DateTime > :sql_last_value"

I've already tried changing the ">" to ">=" but then I have duplicate records.

To avoid duplicate records I've tried having this config (with no luck):

    if [env] == "sc" {
    grok {match => { "datetime" => "%{YEAR:year}-%{MONTHNUM:month}-%{MONTHDAY:day}"}}
    #grok {match => { "datetime" => "%{TIMESTAMP_ISO8601:date}"}}
    mutate {
        convert => { "datetime" => "string" }
    }
    date {
        match => [ "datetime" , "ISO8601"]
        timezone => "America/New_York"
        target => "@timestamp"
    }
    fingerprint {
        source => ["guid", "datetime", "Msisdn"]
        target => "[@metadata][fingerprint]"
        method => "MURMUR3"
    }
    mutate {
        replace => {
            "[type]" => "sc-%{year}%{month}%{day}"
            "[document_type]" => "data"
            "[document_id]" => "%{[@metadata][fingerprint]}"
        }
        remove_field => [ "year", "month", "day", "[fields][env]", "tags", "date","document_type"]
    }
}

What else can I do to get ALL records correctly?

Jenni · April 26, 2018, 8:29pm

Did you specify that document_id as the document_id in your ES output configuration? Otherwise it probably won't work because your fingerprint is not used as the ES _id

(2. The problem are entries that are added in the same second. So if a delay is acceptable, you could adjust your query to only get entries that are at least one second old... Not pretty, but in theory, that would be a solution...)

yaauie · April 26, 2018, 9:05pm

I believe that your problem can be solved by avoiding retrieving partial seconds, and that understanding the nature of windowing will help you avoid missing other events, too.

I make a few assumptions below:

first, that you are retrieving from a table that is constantly being appended to
second, that the data in this table is inserted loosely in order with some known windowing guarantee (e.g., with a one-minute windowing guarantee, once a row with a timestamp of 2018-04-26 02:36:49.000 is inserted into the database, we're reasonably confident that the database will never receive a record one minute older than that, or if it does, we're okay with missing it)

Once the JDBC Input Plugin sees a value for a timestamp, subsequent invocations will only retrieve data that is newer than that timestamp. If data is not inserted exactly in-order (say you have multiple systems inputing data, and one of them lags by a few seconds), we run the risk of the "fast" system telling us to ignore messages that a "slow" system is sending; imagine the following sequence of events:

T+1: fast system A inserts a record a that is 1 second old with timestamp T+0
T+2: JDBC input plugin runs, the newest record it retrieves a from timestamp T+0
T+3: slow system B inserts a record b that is 14-seconds old with timestamp T-17
T+4: JDBC input plugin runs, fetching only data newer than T+0 (note: record b is skipped)

If it takes a few seconds or minutes for our messages to stabilise, we don't want records from our "fast" systems to be telling us to ignore records from our "slow" systems.

To avoid this, we can add a clause to our SQL statement to ensure that we don't get data from this stabilisation window. Each SQL implementation varies a bit in how they implement date functions, but in T-SQL, we could do the following to ensure we don't get data that is from the last 30 seconds:

WHERE L.DateTime < DATEADD(SECOND, -30, GETDATE())

skatemple · April 26, 2018, 11:26pm

Thanks Ry!
That's hell of a great explanation on this workaround, in fact, I believe it's pretty good, I never thought of that solution.

Altough I didn't need to test it because I found out that I was using a field that didn't exist on the finerprint creation, maybe that's why it didn't work and was duplicating documents.

After fixing that fingerprint thing, and after I changed the statement to " WHERE L.DateTime >= :sql_last_value", now everything's ok, and I don't see any duplicate or missing documents.

Thanks anyways! I would for sure use that solution at some point in the future.

skatemple · April 26, 2018, 11:27pm

Hello Jenni,
Yes I did specified the document_id with the metadata fingerprint, my mistake (I believe) was to use a field that didn't exist, and maybe that's why it was failing?

I'm not sure but I only changed that part, and now I don't see any duplicate entry on my index.

Thank you so much

system · May 24, 2018, 11:28pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sql_last_value not working properly Logstash	3	1743	May 11, 2020
Getting some documents ahead of my current time that alter the sql_last_value Logstash	7	489	November 4, 2022
JDBC Input last value overlap Logstash	5	409	July 14, 2021
How should I use sql_last_value in logstash? Logstash	11	27141	July 6, 2017
Jdbc sql_last_value Logstash	2	1783	January 15, 2017

[solved] How to work with sql_last_value on table that has multiple rows on same second

Related topics