Keep only last record of each type

yodog · July 6, 2016, 7:17pm

Hello all.

I have to keep 2 types of indices on elasticsearch:

1- full, regular index holding all records sent to it (logstash-YYYY.MM.DD)
2- small, fast index holding only the last log message of each kind (lastaction-YYYY.MM)

I need some help setting up the number 2 above.

Currently i'm using the fingerprint filter to generate a hash (SHA1) for each
message field, and using this hash as the document_id for the elasticsearch output.

filter {
    if ("cloned" in [tags]) {
        uuid {
            add_tag     => [ "lastlogin" ]
            overwrite   => true
            target      => "@uuid"
        }
        fingerprint {
            key     => "lastlogin"
            method  => "SHA1"
        }
    }
}

output {
    if ("lastlogin" in [tags]) {
        elasticsearch {
            document_id         => "%{fingerprint}"
            index               => "lastaction-%{+YYYY.MM}"
            sniffing            => true
            template_overwrite  => true
        }
    }
}

How would you guys do it?

I'm asking because despite not seeing any duplicated records (i think this part is ok)
the index is bigger in size when compared to de full/regular one.
And it should not.

Also, i expected it would only update the timestamp if a newer record was inserted,
but in fact the @timestamp field is always updated, even if it is a date in the past.

Any ideas?

magnusbaeck · July 8, 2016, 5:50am

How would you guys do it?

That's what I'd do.

I'm asking because despite not seeing any duplicated records (i think this part is ok)
the index is bigger in size when compared to de full/regular one.
And it should not.

It could be larger than expected unless it's optimized to expunge deleted documents. I don't remember to which extent this takes place automatically; check the ES documentation. Keep in mind that ES treats updates like a delete followed by a new document.

Also, i expected it would only update the timestamp if a newer record was inserted,
but in fact the @timestamp field is always updated, even if it is a date in the past.

I'm afraid I don't understand this part.

yodog · July 8, 2016, 11:31am

sometimes i have to feed old logs to elasticsearch.

lets say that i have the following line indexed:

@timestamp  July 8th 2016, 08:00:20.002
message     open: user edgar-allan^poe@bookwriters.com opened INBOX/Trash

and later i backfill a file from july 6th.

@timestamp  July 6th 2016, 01:00:10.001
message     open: user edgar-allan^poe@bookwriters.com opened INBOX/Trash

i would like to maintain the most recent record based on @timestamp, but logstash just replaces the existing record with the last one.

how can i make it evaluate the @timestamp field, updating the record if more recent, but discarding if older?

magnusbaeck · July 8, 2016, 11:39am

Logstash will update the document as a whole so it seems very unlikely that it wouldn't update @timestamp too. Have you double-checked that the @timestamp field really has the correct contents when you're backfilling data?

yodog · July 8, 2016, 11:48am

yes, the content is correct.

i am trying to mimic a SQL trigger, that just updates the @timestamp field.
(it should never go back in time)

CREATE TRIGGER dbo.trgAfterUpdate ON dbo.TblLastAction
AFTER UPDATE 
AS
  UPDATE dbo.TblLastAction
  SET last_changed = GETDATE()
  FROM Inserted i

Topic		Replies	Views
Configuring Pipeline To Handle Duplicates In Rollover Indices Logstash	3	1000	September 20, 2019
Update document on multiple indices Logstash	2	948	May 3, 2017
Removing Duplicate documents in ElasticSearch Elasticsearch	2	362	June 11, 2019
Avoid duplication Logstash	13	4834	December 7, 2018
Logstash generating duplicated index Logstash	1	467	September 5, 2017

Keep only last record of each type

Related topics