Logstash pipeline High avliliblity

hello Everyone,

How to achieve The Logstash Pipeline HA?
I have 3 Nodes if any node opening count of pipeline after that the node is down what happen?
what is the solutions to achieve the HA?

Logstash does not have any builtin features to enable HA and each logstash instance is independent from each other.

To have some kind of HA with Logstash you will need to use third-party tools like HAProxy for load balance the requests between your nodes.

In case the pipeline was down there is no solutions except third-party load balancer?

As I said, Logstash nas no HA built-in, to have some kind of HA you need third-party tools lika a load balancer.

If a Logstash pipeline goes down for some reason, you will need to check the reason and fix it, depending on the way you are collecting data, having multiple nodes behind a Load Balancer can help you, but you still need to check the reason and fix it.

I understand you but in case the Logstash pipline goes down and I caption large mount of data par second on the UDP Port number, it is unreasonable?!!, I will lost a large mount of data in this case if happened down!!.

You will need to configure your sources to send to the load balancer that will then send to your Logstash nodes, but you still may have some kind of data loss.

This is expected, even more when using UDP.

So I want to handle this case
I put more one solution to over come this case

  • run same pipeline in each node I will face duplication of data because UPD protocol don't send acknowledgment
  • Run cron job script to run pipeline in case the other pipeline goes down I will face lost of data in the time that will take to open another pipeline in the other node.

are you idea to solve any thing?

Not sure how this would fix this issue. What will send data using UDP? It is a network device? Depending on the source you may not be able to send data to multiple locations.

This will not solve your issue, the minimun interval you have per default for a cronjob is 1 minute and when you add a new pipeline to logstas it takes some time to start-up, you may end losing data for a couple of minutes in this case.

A lod balancer like HAProxy replaces the two things you proposed above, you can configure it to send to multiple logstash instances without duplicating data and it checks if an instance is down and stop sending data to it way fast than a cronjob.

Depending on what is sending the data, you may put Kafka in between to uncouple the shipping of the data and the consumption of the data.

This is what I use, I have a load balancer that sends the data to two logstash instances where the only function is to send data to Kafka Topics, then I have other logstash instances consuming the data from those Kafka Topics.

Having no data loss in this use case is really not an easy problem to solve, you may end with a lot of pieces in your infrastructure and spend a lot of money and still get some data loss.

You mean running the same Logstash pipeline in more one different node and so too the same index and then use load balancer to send data to 1st node once it is goas down load balancer move it the 2nd node?

It depends of what you want, you can configure a load balancer to work this way, have one active back server and only send data to other servers in case of failure.

But you can also configure it to distribute the request between the servers.

1 Like

This approach doesn't cause the duplicate of data?

No, this is how load balancers workers, you configure it to distribute request between the servers using some algorithm like round robin or least connections.

For round robin for example, a request goes to server 1, the next request goes to server 2, the following one goes to server 3, if you have only 3 servers, the following request will go to server 1 and etc

If clients are using keepalive connections then the load balancer is actually distributing connections, not requests. There are cases where that is important, such as where a server goes down, and all of the clients establish new connections before it comes back up. It will then get zero requests until one of the persistent connections times out.

as mentioned you could usea kafka cluster and use the same configs on all 3 logstash instances if you pulling from the same topics which could achieve some degree of high availability and uncouple
shipping and distribution of the data unless you want to go for a load balancer option .. i prefer kafka