Logstash: filter unique key documents

Moshe_Sharon1 · September 7, 2022, 11:58am

Hello,

We have an index which has multiple documents with the same phone number.

Each document will always contain the phone number and may contain additional information (see example below)

The documents are written into the index constantly.

For a certain time window, for example 1 minute, we would like to have only one record for each unique phone number.

This data is sent via Logstash into an S3 bucket.

We do not care which record will be selected (first, last, random) but we need only one.

Input index example:

phone no	time	data
50-5325471	1:10:50	A
50-5325471	1:10:51	B
50-5325471	1:10:52	C
55-6789345	1:10:50	A
55-6789345	1:10:53	B
57-3434345	1:10:55	C
50-5325471	1:15:50	D
50-5325471	1:15:55	E
55-6789345	1:15:50	F

Output data example:

phone no	time	data
50-5325471	1:10:50	A
55-6789345	1:10:50	A
50-5325471	1:15:55	E
55-6789345	1:15:50	F

The idea is to have a query that will filter one record per phone number for each Logstash iteration.

Can someone please help on the way to do that.

Thanks

From: Moshe Sharon <moshe.sharon@cellwize.com>
Sent: Wednesday, 7 September 2022 12:02
To: Tomer Bruchiel <tomer.bruchiel@cellwize.com>
Subject: Please check and correct

Logstash Filter problem

We have an index which has multiple documents with the same phone number.

ex: table showing index example

phone no	time	name	address	…	…
50-5325471
50-5325471
50-5325471
55-6789345
55-6789345
57-3434345
50-5325471
50-5325471
55-6789345

I want Logstash to write as output the latest(or first) occurrence of same phone no from every batch (same phone no returning one after the other)

thanks

Moshe
sharon.moshe@gmail.com

Rios · September 7, 2022, 12:22pm

You can use Fingerprint plugin to calculate unique value which will used for document_id
Something like this should work if time is format hh:mm, not with seconds. In one minute you will have insert A, update B, update C. At the end, you will have only the last record like data - C, as update.

filter{
...
    fingerprint {
       source => [ "phone no","time" ]
       concatenate_sources => true
       target => "[@metadata][docid]"
       method => "UUID"
       id => "fingerprint" 
    }
...
}
output {
  elasticsearch {
  hosts => ["http://localhost:9200"]
  index => "phones"
  document_id => "%{[@metadata][docid]}"
  }
}

Badger · September 7, 2022, 3:20pm

Assuming you parsed the [time] field into [@timestamp] you can add a metadata field to use in the fingerprint that only has hours and minutes

mutate { add_field => { "[@metadata][time]" => "%{{HH:mm}}" } }

Rios · September 7, 2022, 8:32pm

You can also try with aggregations

Moshe_Sharon1 · September 8, 2022, 1:08pm

Hi Rios
Thank you

system · October 6, 2022, 1:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicate data in logstash Logstash	2	1091	July 6, 2017
Remove duplicates in logstash logs Logstash	4	4235	June 29, 2018
Drop documents if certain field values are the same as the previous document Logstash	3	441	January 2, 2018
Logstash Fingerprint Issue Logstash	1	350	November 5, 2020
Duplicate logs in Logstash Logstash	9	462	December 21, 2023

Logstash: filter unique key documents

Related topics