How to use document id to avoid duplication of logs?

Nani_20 · May 28, 2020, 3:50am

Hi,

I'm using logstash file input to index various log files available in the input location. These logfiles are generated by log4j in various systems and are fetched using a cron job to the input location.

Is there any way I can use filebeats to fetch logs created by log4j in these systems? As I mentioned above, currently I am using a batch script cron job to fetch these from the user systems to the input location.

By default, log4j creates the backup of a log file after a size limit, so the logstash receives duplicate logs from time to time.

Is there any way to use an ID to avoid indexing duplicate data? Currently, I'm using custom document_id with a combination of @timestamp and ID field(see below my output filter). But, this seems to be overwriting the indexed data(correct me if I am wrong here). Instead, I would like to avoid indexing if it is a duplicate.

My Output filter

output{
   elasticsearch{
      hosts => ["http://localhost:9200"]
      index => "test"
      document_id => "%{@timestamp}_%{ID}"
   }
}

Any help here is appreciated. Thanks in advance

ptamba · May 28, 2020, 4:26am

filebeat has a logstash module

doc_as_upsert directives allows creation of new document if the document_id does not exist in ES. ES will overwrite (update) document if the same document_id exists.

Nani_20 · May 28, 2020, 8:54am

@ptamba thanks for the reply.
For the second question, I don't ES to update the file. Instead, i want logstash to ignore the duplicate. Is there any way i can do that?

ptamba · May 28, 2020, 9:29am

i haven't tried this but specifying action create appears to avoid overwriting an existing document

" * create: indexes a document, fails if a document by that id already exists in the index. "

Nani_20 · May 28, 2020, 10:28am

action => "create"

Yes, the above action option is avoiding the duplicate to index. But, since this is in output plugin, I can see the duplicates passing through all my filter plugin operations which is kind of redundant.
Is there any way to identify the duplicate before the filter plugin and avoid it?

ptamba · May 28, 2020, 11:48am

not that i’m aware of.

Nani_20 · May 28, 2020, 6:28pm

@ptamba Yes, I can use filebeat agent and point the output to my Logstash instance.
But, in my case there are 100s of users. Do I need to manually install filebeat in each system. Is there any simpler way to do this?
Thank You

ptamba · May 29, 2020, 2:56am

if you want to use filebeat to collect logs on those systems then yes you have to install it on every system.

system · June 26, 2020, 2:56am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Preventing duplicates when reading the same data multiple times Logstash	3	708	June 22, 2021
ES query to check the existence of a document_id? Logstash	10	1008	June 26, 2020
Avoid duplicate document in different Indices,Logsatsh Logstash	2	503	July 28, 2022
Duplicate messages when using custom document_id Logstash	1	752	March 12, 2018
Prevent new document with the same docuemnt id replacing the old one Logstash	5	478	March 1, 2018

How to use document id to avoid duplication of logs?

Related topics