How to use input plugins with multiple Logstash hosts?

gueri · August 6, 2024, 2:49pm

Hi,

I mainly use Logstash with Kafka input without any issue.

But now I want to read some logs from elastic and write them to Kafka. So I used the elasticsearch input plugin.

But what a surprise to see many logs in my output topic. After data investigation, I observe x4 more logs in my topic . I realized that in our production environment we have 4 hosts in the logstash cluster. This would explain why I see x4 logs in the output topic.

This is the only explanation I have. Indeed it is not a problem with Kafka to have 4 hosts with the same consumer group! But It is an another concern to query the same elastic index by 4 different hosts. Our ELK engineer is out of office, so I decide to post my question here. Maybe someone know how to deal with this challenge ?

Eric

leandrojmp · August 6, 2024, 6:00pm

Can you please provide more context? It is a little confusing.

You have 4 hosts of what? 4 kafka brokers? 4 logstash nodes? 4 Elasticsearch nodes?

I'm assuming that you have an Elasticsearch cluster with 4 nodes, is that correct? Or do you have 4 different Elasticsearch clusters?

Please share the configuration you are using.

gueri · August 6, 2024, 6:46pm

Hello leandrojmp,

I can't share technical configuration because I am an end user of ELK in my company and I don't have enough rights to have access to servers.

But when I say 4, it means 4 Logstash nodes. All nodes are active. So I suspect that when I deploy a logstash pipeline from kibana, the pipeline is running on all nodes. So my elastic query in the input is executed 4 times (1 query on each node).

input {
   elasticsearch {
      hosts => "my_elastic:9200"
      user => "my_user"
      password => "${LOGSTASH_AIOPS_PWD}"
      ssl_enabled => true
      ssl_certificate_authorities => "ca.pem"
      index => "my_index"
      query => '{ "query": { "range": { "event.ingested": { "gte": "now-15m/m" } } }, "sort": ["_doc"] }'
      schedule => "*/15 * * * *"
      size => 10000
      scroll => "1m"
      target => "_tmp"
   }
}

filter {
   mutate {
       .....
   }
}

output {
   kafka {
      codec => json
      bootstrap_servers => "xxxx"
      topic_id => "yyyy"
      security_protocol => "SSL"
      ssl_keystore_location => "keystore.jks"
      ssl_keystore_password => "${JKS_AIOPS_PWD}"
      ssl_keystore_type => "jks"
      ssl_truststore_location => "trustore.jks"
      ssl_truststore_password => "${JKS_AIOPS_PWD}"
      ssl_truststore_type => "jks"
   }
}

I never see this problem with Kafka as input in my previous pipeline because if the 4 nodes are running the same pipeline then they are using the same consumer group and so each message is read only once.

leandrojmp · August 6, 2024, 7:14pm

Ah Yeah, this makes sense.

If you are deploying logstash configurations in Kibana, then you are using the Centralized Pipeline Management.

With the Centralized PIpeline Management, each logstash instance is configured to get the pipelines configurations from Elasticsearch and you can also configure each instane to run all pipelines or specific pipelines.

From what you shared it seems that your logstash nodes are configured to run all pipelines, so the behavior you describe is expected.

You have 4 different nodes making the same query to your elasticsearch, so you will have a response for this query on each one of the 4 nodes.

And of course this will lead to duplication of data.

What you need is to have this pipeline being executed on just one of the servers, but since you are using Centralized Pipeline Management this requires changes directly on the logstash service on each server and a restart.

You would need to specify in each logstash.yml file which pipeline it should run, probably now you have the setting xpack.management.pipeline.id as *, which means that each instance runs everything, this needs to be changed on every node.

Yeah, with Kafka when using the same group id you do not have duplication of the data, but the Elasticsearch input does not work like that nor it have any similar feature like this.

gueri · August 7, 2024, 9:54am

Hello,
You're totally right! xpack.management.pipeline.id is set to *. I will make a request to our support team to allow a pipeline running on a single node. I understand that with the xpack.management.pipeline.id parameter we can simply specify on a node the name of a pipeline to run. The only drawback I see is that the pipeline will not be fault tolerant. If my pipeline is running on node1, if node1 failed the pipeline will not be executed on another node. Maybe a technical solution is possible to avoid this, I will discuss it with the support team.

Topic		Replies	Views
Need help with Kafka and Logstash with elastic Search cluster Logstash	4	337	June 7, 2019
Kafka input plugin creates anywhere from 2-5x more documents than in topic Logstash	1	455	May 22, 2017
Multiple elasticsearch input Logstash	5	577	September 13, 2017
Logstash, Kafka & Elasticsearch Logstash	12	1850	October 17, 2017
Logstash location Logstash	5	636	July 6, 2017

How to use input plugins with multiple Logstash hosts?

Related topics