I Want to remove the duplicate events inside Logstash filter how could I do that? I mention the events below please have a look and suggest

Subrato1 · December 12, 2023, 4:39pm

{
             "date" => 2023-12-12T00:00:00.000Z,
         "category" => "AUTH",
         "username" => "cassandra",
       "event_time" => "ab390a7b-98e7-11ee-af20-4b75abbb029d",
             "node" => "172.31.57.239",
      "consistency" => "",
           "source" => "152.58.118.34",
    "keyspace_name" => "",
       "table_name" => "",
       "@timestamp" => 2023-12-12T12:13:06.527Z,
             "type" => "test",
        "operation" => "LOGIN",
            "error" => false,
         "@version" => "1"
}
{
             "date" => 2023-12-12T00:00:00.000Z,
         "category" => "AUTH",
         "username" => "cassandra",
       "event_time" => "aa76515a-98e7-11ee-a2a5-4b76abbb029d",
             "node" => "172.31.57.239",
      "consistency" => "",
           "source" => "152.58.118.34",
    "keyspace_name" => "",
       "table_name" => "",
       "@timestamp" => 2023-12-12T12:13:06.527Z,
             "type" => "test",
        "operation" => "LOGIN",
            "error" => false,
         "@version" => "1"
}

Badger · December 12, 2023, 6:06pm

You can use a fingerprint filter with the concatenate_all_fields option set to true. If you are sending events to elasticsearch then use the fingerprint as the document_id and duplicate events will be overwritten.

If you really want to do the de-duplication in logstash (because you are not writing to elasticsearch) then you would need to use a ruby filter that builds a cache of recently seen fingerprints. You would look for the fingerprint in the cache and event.cancel if it is found, or add it to the cache if not. If you have multiple worker threads then you will need to synchronize access to the cache, and you will need to implement a cache purge strategy. Decidely non-trivial.

Subrato1 · December 14, 2023, 4:59am

Aggregate will work here or not, Because I want to drop duplicate log based on "operation" property ??

Badger · December 14, 2023, 6:24pm

I do not know, I have never tried to do deduplication with an aggregate filter.

leandrojmp · December 14, 2023, 6:29pm

Are you sure those are duplicates? The value of event_time is not the same.

Can you provide more context on how those events are created? What is the source?

Subrato1 · December 15, 2023, 4:31am

Yes, I am executing queries from SCYLLDB and after LOGIN these duplicate LOGIN logs are generating with different time. Some time it also duplicates query audit logs also.

How could I remove duplicate because I never encounter this type of problem.

Below are duplicate logs I executed once only but generating duplicate logs.

 2023-12-14 00:00:00.000000+0000 | 172.31.57.239 | f938263b-9a4c-11ee-b9a0-ac113ba300b3 |      DDL |         ONE | False |    mykeyspace |          CREATE TABLE testing (\n    a int,\n    b int,\n    c int,\n    PRIMARY KEY (a, b, c)\n); |     127.0.0.1 |    testing | cassandra
 2023-12-14 00:00:00.000000+0000 | 172.31.57.239 | f938315d-9a4c-11ee-8634-ac103ba300b3 |      DDL |         ONE | False |    mykeyspace |          CREATE TABLE testing (\n    a int,\n    b int,\n    c int,\n    PRIMARY KEY (a, b, c)\n); |     127.0.0.1 |    testing | cassandra

leandrojmp · December 15, 2023, 1:35pm

I am executing queries from SCYLLDB and after LOGIN these duplicate LOGIN logs are generating with different time.

If the time is different, how could these be duplicate? How do you distinguish duplicate events from two login events at the same time?

In the logs you shared now the third column is different, so it looks like a completely different event, is this column part of your original event or it is added by Scylla DB? Never used Scylla DB, so I'm not sure what this means.

To avoid duplicates you need to use a custom id for your documents, the de-duplicate will be done in Elasticsearch, it will overwrite the document everytime a document with the same id is indexed.

You would need to use the fingerprint filter in logstash to create a unique id, but you cannot use the entire message or all fields because your message are not really duplicates for Logstash, there is a field that is different.

So, if you consider those messages to be duplicates, you need to concatenate some fields, like the time, ip address, operation etc using the fingerprint filter to create a unique id.

This blog post has some examples and you should also check the documentation for the fingerprint filter.

system · January 12, 2024, 1:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to remove duplicate events in logstash Logstash	3	4419	January 4, 2017
I am trying to deduplicate my events one the basis of timestamp and operation field. But it did not work? Logstash	1	114	February 20, 2024
Remove duplicates in logstash logs Logstash	4	4228	June 29, 2018
I'm receiving just one log Kibana	28	1039	August 7, 2018
Logstash produces duplicates Logstash	3	1186	July 6, 2017

I Want to remove the duplicate events inside Logstash filter how could I do that? I mention the events below please have a look and suggest

Related topics