Logstash: filter unique key documents

Hello,

We have an index which has multiple documents with the same phone number.

Each document will always contain the phone number and may contain additional information (see example below)

The documents are written into the index constantly.

For a certain time window, for example 1 minute, we would like to have only one record for each unique phone number.

This data is sent via Logstash into an S3 bucket.

We do not care which record will be selected (first, last, random) but we need only one.

Input index example:

phone no time data address
50-5325471 1:10:50 A
50-5325471 1:10:51 B
50-5325471 1:10:52 C
55-6789345 1:10:50 A
55-6789345 1:10:53 B
57-3434345 1:10:55 C
50-5325471 1:15:50 D
50-5325471 1:15:55 E
55-6789345 1:15:50 F

Output data example:

phone no time data address
50-5325471 1:10:50 A
55-6789345 1:10:50 A
50-5325471 1:15:55 E
55-6789345 1:15:50 F

The idea is to have a query that will filter one record per phone number for each Logstash iteration.

Can someone please help on the way to do that.

Thanks

From: Moshe Sharon <moshe.sharon@cellwize.com>
Sent: Wednesday, 7 September 2022 12:02
To: Tomer Bruchiel <tomer.bruchiel@cellwize.com>
Subject: Please check and correct

Logstash Filter problem

We have an index which has multiple documents with the same phone number.

ex: table showing index example

phone no time name address
50-5325471
50-5325471
50-5325471
55-6789345
55-6789345
57-3434345
50-5325471
50-5325471
55-6789345

I want Logstash to write as output the latest(or first) occurrence of same phone no from every batch (same phone no returning one after the other)

thanks

Moshe
sharon.moshe@gmail.com

You can use Fingerprint plugin to calculate unique value which will used for document_id
Something like this should work if time is format hh:mm, not with seconds. In one minute you will have insert A, update B, update C. At the end, you will have only the last record like data - C, as update.

filter{
...
    fingerprint {
       source => [ "phone no","time" ]
       concatenate_sources => true
       target => "[@metadata][docid]"
       method => "UUID"
       id => "fingerprint" 
    }
...
}
output {
  elasticsearch {
  hosts => ["http://localhost:9200"]
  index => "phones"
  document_id => "%{[@metadata][docid]}"
  }
}

Assuming you parsed the [time] field into [@timestamp] you can add a metadata field to use in the fingerprint that only has hours and minutes

mutate { add_field => { "[@metadata][time]" => "%{{HH:mm}}" } }

You can also try with aggregations

Hi Rios
Thank you

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.