How to use document id to avoid duplication of logs?

Hi,

I'm using logstash file input to index various log files available in the input location. These logfiles are generated by log4j in various systems and are fetched using a cron job to the input location.

  1. Is there any way I can use filebeats to fetch logs created by log4j in these systems? As I mentioned above, currently I am using a batch script cron job to fetch these from the user systems to the input location.

By default, log4j creates the backup of a log file after a size limit, so the logstash receives duplicate logs from time to time.

  1. Is there any way to use an ID to avoid indexing duplicate data? Currently, I'm using custom document_id with a combination of @timestamp and ID field(see below my output filter). But, this seems to be overwriting the indexed data(correct me if I am wrong here). Instead, I would like to avoid indexing if it is a duplicate.

My Output filter

output{
   elasticsearch{
      hosts => ["http://localhost:9200"]
      index => "test"
      document_id => "%{@timestamp}_%{ID}"
   }
}

Any help here is appreciated. Thanks in advance

filebeat has a logstash module

doc_as_upsert directives allows creation of new document if the document_id does not exist in ES. ES will overwrite (update) document if the same document_id exists.

@ptamba thanks for the reply.
For the second question, I don't ES to update the file. Instead, i want logstash to ignore the duplicate. Is there any way i can do that?

i haven't tried this but specifying action create appears to avoid overwriting an existing document

" * create: indexes a document, fails if a document by that id already exists in the index. "

action => "create"

Yes, the above action option is avoiding the duplicate to index. But, since this is in output plugin, I can see the duplicates passing through all my filter plugin operations which is kind of redundant.
Is there any way to identify the duplicate before the filter plugin and avoid it?

not that i’m aware of.

@ptamba Yes, I can use filebeat agent and point the output to my Logstash instance.
But, in my case there are 100s of users. Do I need to manually install filebeat in each system. Is there any simpler way to do this?
Thank You

if you want to use filebeat to collect logs on those systems then yes you have to install it on every system.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.