I Want to remove the duplicate events inside Logstash filter how could I do that? I mention the events below please have a look and suggest

{
             "date" => 2023-12-12T00:00:00.000Z,
         "category" => "AUTH",
         "username" => "cassandra",
       "event_time" => "ab390a7b-98e7-11ee-af20-4b75abbb029d",
             "node" => "172.31.57.239",
      "consistency" => "",
           "source" => "152.58.118.34",
    "keyspace_name" => "",
       "table_name" => "",
       "@timestamp" => 2023-12-12T12:13:06.527Z,
             "type" => "test",
        "operation" => "LOGIN",
            "error" => false,
         "@version" => "1"
}
{
             "date" => 2023-12-12T00:00:00.000Z,
         "category" => "AUTH",
         "username" => "cassandra",
       "event_time" => "aa76515a-98e7-11ee-a2a5-4b76abbb029d",
             "node" => "172.31.57.239",
      "consistency" => "",
           "source" => "152.58.118.34",
    "keyspace_name" => "",
       "table_name" => "",
       "@timestamp" => 2023-12-12T12:13:06.527Z,
             "type" => "test",
        "operation" => "LOGIN",
            "error" => false,
         "@version" => "1"
}

You can use a fingerprint filter with the concatenate_all_fields option set to true. If you are sending events to elasticsearch then use the fingerprint as the document_id and duplicate events will be overwritten.

If you really want to do the de-duplication in logstash (because you are not writing to elasticsearch) then you would need to use a ruby filter that builds a cache of recently seen fingerprints. You would look for the fingerprint in the cache and event.cancel if it is found, or add it to the cache if not. If you have multiple worker threads then you will need to synchronize access to the cache, and you will need to implement a cache purge strategy. Decidely non-trivial.

1 Like

Aggregate will work here or not, Because I want to drop duplicate log based on "operation" property ??

I do not know, I have never tried to do deduplication with an aggregate filter.

Are you sure those are duplicates? The value of event_time is not the same.

Can you provide more context on how those events are created? What is the source?

Yes, I am executing queries from SCYLLDB and after LOGIN these duplicate LOGIN logs are generating with different time. Some time it also duplicates query audit logs also.

How could I remove duplicate because I never encounter this type of problem.

Below are duplicate logs I executed once only but generating duplicate logs.

 2023-12-14 00:00:00.000000+0000 | 172.31.57.239 | f938263b-9a4c-11ee-b9a0-ac113ba300b3 |      DDL |         ONE | False |    mykeyspace |          CREATE TABLE testing (\n    a int,\n    b int,\n    c int,\n    PRIMARY KEY (a, b, c)\n); |     127.0.0.1 |    testing | cassandra
 2023-12-14 00:00:00.000000+0000 | 172.31.57.239 | f938315d-9a4c-11ee-8634-ac103ba300b3 |      DDL |         ONE | False |    mykeyspace |          CREATE TABLE testing (\n    a int,\n    b int,\n    c int,\n    PRIMARY KEY (a, b, c)\n); |     127.0.0.1 |    testing | cassandra

I am executing queries from SCYLLDB and after LOGIN these duplicate LOGIN logs are generating with different time.

If the time is different, how could these be duplicate? How do you distinguish duplicate events from two login events at the same time?

In the logs you shared now the third column is different, so it looks like a completely different event, is this column part of your original event or it is added by Scylla DB? Never used Scylla DB, so I'm not sure what this means.

To avoid duplicates you need to use a custom id for your documents, the de-duplicate will be done in Elasticsearch, it will overwrite the document everytime a document with the same id is indexed.

You would need to use the fingerprint filter in logstash to create a unique id, but you cannot use the entire message or all fields because your message are not really duplicates for Logstash, there is a field that is different.

So, if you consider those messages to be duplicates, you need to concatenate some fields, like the time, ip address, operation etc using the fingerprint filter to create a unique id.

This blog post has some examples and you should also check the documentation for the fingerprint filter.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.