Logstash as a Statefulset in Kubernetes - File Input and duplicated logs

Hello,
I am trying to deploy a multiple pod logstash Statefulset on a kubernetes cluster using the Input File type.
It looks like each pod is reading the same logs from the logfile placed on a PVC, and therefore we are getting duplicated logs in our Elastic instance.
Eg: 2 pods runnning --> 2 logs with same content being posted to Elastic.

Any hints on the configuration to get this solved?

Thanks!

Hello and welcome,

You will need to share your Logstash configuration, you didn't share it.

You cannot have two or more different instances reading from the same log files, this will lead to duplication.

You are right.

  logstash.yml: |
    http.host: "0.0.0.0"
    path.config: /usr/share/logstash/pipeline
    queue.type: persisted
    queue.max_bytes: 1gb

  pipeline.conf: |
    input {
     file {
     path => "/var/app-logs/logs/*"
     exclude => "*.gz"
     mode => "tail"
     start_position => "end"
     codec => "json"
     sincedb_clean_after => "7"
     sincedb_path => "/var/mes-logs/logstash-sincedb"
     }
     }

Not sure what is the issue.

If you have 2 or more logstash instances reading the same logs, it will be duplicated.

While you can use a custom id to avoid duplication on Elasticsearch side, I would not recommend it in this case.

Use just one instance.

Then that means logstash can't be deployed as statefulset in kubernetes and make it high available with more than one pod running?
The idea of using multiple pods was to improve perfromance and to be able scale horizontally.
I can't see any info about this in the Logstash Helm Chart: helm-charts/logstash at main · elastic/helm-charts · GitHub

I do not use Kubernetes, but to have an HA deployment of Logstash you need third-party tools, and it also depends on your input.

For example, if you are receiving data using a TCP or UDP input, you need a load balance in front of your Logstash, then you can have as many Logstash as you want, also if you are consuming data from Kafka, you can also have multiple Logstash.

But for the file input you need to read the file and track the position read, so to have 2 or more tools doing this adds a lot more of unnecessary complexity, that's one of the reasons that you should have just one tool reading the files.

Logstash alone has no support for any kind of HA deployment.

As mentioned, it depends on the input, but also in most of the time the performance issues or bottlenecks are not on Logstash side, but on the receiving side, so scale logstash horizontally maybe not help anything and can in some scenarios make things worse.

Also as mentioned, Logstash has no support for working on HA on its own, its need third-party tools and that your data uses some specific inputs that allows load-balancing for example.

Thanks for your feedback!
I am having a current set up with HTTP input that is working with an ingress handling the load balancing, but wanted to reduce the http calls I have inside my cluster with this File input, but if i can't scale... I need to double check what works better in my case scenario.

Again, thanks for your input @leandrojmp !