Looking to see how others might have approached this topic in the past.
For examples sake I have 2 logstash nodes running in a hot/cold standby mode that run the exact same (centrally managed) pipelines that pull data from jdbc and kafka queues.
The standby is just left in a stopped state and needs manual intervention to flip from the online node to the offline in the event the online node has issues.
It works but relies on us getting an alert that the primary node is down ro invoke a manual flip.
I want to find a way of being able to have both nodes started but only 1 running the pipelines so we don't ingest duplicate data.
I've managed to do something convoluted in the filter section using Ruby to drop records under certain circumstances but this obviously runs for each record being processed where as ideally I'd be able to run something in the input filter to dynamicly enable/disable that pipeline on the secondary node.
I've looked at the fingerprint plugin but again this runs on a record by record basis and puts additional load on both LS and elastic so trying to avoid if possible
How else are folk running logstash in a HA configuration when it comes to pulling data from sources like jdbc/kafka etc...
I don't think there is an easy way to do what you want as it doesnt seem that Logstash was designed to work this way, with actives and stand-by nodes.
Normally you use Logstash in HA when you are shipping data to Logstash, not pulling data with it as pulling data adds some complexity regarding duplication and missing data.
For example, if you run two instances with kafka input you have to choose a group_id, if you choose the same group_id they will work in parallel, if you choose a different group_id the data will be consumed twice and could be duplicated, a similar issue would happen with the jdbc inputs as both instances would need to know what was the last data consumed.
The inputs do not have any dynamic configuration to enable or disable it, so you would need to do that using filters, but as you said, it will imply in consuming the data twice.
I can think in two ways.
One would be to have the stand-by logstash node running with the consumers pipelines for kafka and jdbc commented in the pipelines.yml file, of course you would need a dummy pipeline active to keep the service running, then you can automate the flip with some scripts.
If the custom script detects that the active logstash is not working, it would uncomment the pipelines from pipeline.yml and logstash would automatically reload.
The second way adds complexity to your structure as you would need another logstash and a load balancer.
This logstash would be the one consuming from the sources, but without doing any filtering process, it would then send the data to a load balancer where you would have the two logstash for process the data, one active and the other as a backup, of course this adds more complexity and a single point of failure.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.