I mainly use Logstash with Kafka input without any issue.
But now I want to read some logs from elastic and write them to Kafka. So I used the elasticsearch input plugin.
But what a surprise to see many logs in my output topic. After data investigation, I observe x4 more logs in my topic . I realized that in our production environment we have 4 hosts in the logstash cluster. This would explain why I see x4 logs in the output topic.
This is the only explanation I have. Indeed it is not a problem with Kafka to have 4 hosts with the same consumer group! But It is an another concern to query the same elastic index by 4 different hosts. Our ELK engineer is out of office, so I decide to post my question here. Maybe someone know how to deal with this challenge ?
I can't share technical configuration because I am an end user of ELK in my company and I don't have enough rights to have access to servers.
But when I say 4, it means 4 Logstash nodes. All nodes are active. So I suspect that when I deploy a logstash pipeline from kibana, the pipeline is running on all nodes. So my elastic query in the input is executed 4 times (1 query on each node).
I never see this problem with Kafka as input in my previous pipeline because if the 4 nodes are running the same pipeline then they are using the same consumer group and so each message is read only once.
If you are deploying logstash configurations in Kibana, then you are using the Centralized Pipeline Management.
With the Centralized PIpeline Management, each logstash instance is configured to get the pipelines configurations from Elasticsearch and you can also configure each instane to run all pipelines or specific pipelines.
From what you shared it seems that your logstash nodes are configured to run all pipelines, so the behavior you describe is expected.
You have 4 different nodes making the same query to your elasticsearch, so you will have a response for this query on each one of the 4 nodes.
And of course this will lead to duplication of data.
What you need is to have this pipeline being executed on just one of the servers, but since you are using Centralized Pipeline Management this requires changes directly on the logstash service on each server and a restart.
You would need to specify in each logstash.yml file which pipeline it should run, probably now you have the setting xpack.management.pipeline.id as *, which means that each instance runs everything, this needs to be changed on every node.
Yeah, with Kafka when using the same group id you do not have duplication of the data, but the Elasticsearch input does not work like that nor it have any similar feature like this.
Hello,
You're totally right! xpack.management.pipeline.id is set to *. I will make a request to our support team to allow a pipeline running on a single node. I understand that with the xpack.management.pipeline.id parameter we can simply specify on a node the name of a pipeline to run. The only drawback I see is that the pipeline will not be fault tolerant. If my pipeline is running on node1, if node1 failed the pipeline will not be executed on another node. Maybe a technical solution is possible to avoid this, I will discuss it with the support team.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.