Question on "Multiple Connections for Logstash High Availability" diagram published in logstash documentation

Hi Logstash Team

Can you explain whether duplicate events or log messages will be processed if we implement the logstash stack as per HA architecture explained in 2nd diagram of "Multiple Connections for Logstash High Availability" section ? This is because when multiple logstash instances are pointing to same input sources and types, how do we restrict publishing of duplicate messages to output MQs ?

URL reference:
https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html

1 Like

@RajkumarV

If duplicate messages are read, then they will be processed in duplicate. To prevent that, messages are "Popped" out of the queue by individual Logstash instances. This way only one instance processes each message.

Alternately you could have a queue dedicated to each Logstash instance. This might be easier to set up, but there is greater chance of data loss.

Also note that in the diagram you speak of, each shipping instance actually points to different sources, which solves the problem for sure!

Does this answer your question?

Hi Pheadrus

Thanks for getting back and No, that does not justify.

I am talking about the diagram where each Logstash shipping instance (3 shipping instance processing all 4 source types and especially from same sources) is processing all types of message sources - UDP, File, RSS and Twitter input plugins. Can you explain what is the intended idea behind this stack implementation and how to avoid duplicate message processing in such an architecture setup.

As per my understanding, for providing HA, the diagram shows multiple shipping instances reading from same source. So all the messages logged from the same source will be read by all shipping instances and sent to its individual messaging queue. So from there hope you can imagine how the duplicate messages flow through the system pipeline.

Please clarify if my understanding is incorrect.

@RajkumarV

I understand what you are saying, and I discussed this with the Logstash team. The intended idea of this architecture is that duplication must be avoided by nature of the input.

For example, UDP input will never duplicate, because messages are probably being load balanced across instances - so there is true HA there. But from something like Twitter or RSS, you would configure each logstash instance to read from a separate feed, and in that case the architecture provides scalability, not necessarily HA.

I hope this helps answer your question.

Thanks Phaedrus, really appreciate your time to respond back and Yes, I agree with the Scalability aspect that the given architecture can meet. However still I am not sure about the HA aspect of it. Can you please explain more on the statement

"UDP input will never duplicate, because messages are probably being load balanced across instances - so there is true HA there"

Does this mean that if all the logstash shipping instances are configured with input plugin as UDP with same hostname and port, the plugin / protocol would automatically take care of diverting the traffic to various logstash shipping instances ensuring duplication messages are not being passed or picked by more than 1 logstash shipping instance?

Thanks again!!!

@RajkumarV

There are a few ways that you can design this, but generally you want to have an intermediary device such as a load balancer to distribute traffic across your logstash instances. You could use any commercial hardware load balancer, or something like HAProxy. Send your UDP traffic to the Virtual IP on the Load balancer, which then distributes traffic evenly between logstash instances. Logstash does not have any awareness of the load balancing, it just simply listens for UDP messages.

The true HA comes from the fact that most load balancers will periodically check the status of pool members, and take them out of service if they fail.

Does this help?

also note that UDP is a "send only once" protocol with no error checking, handshakes or acknowledgements so your sending application will not attempt to resend traffic to another address if that receiving application does not receive it.

To the best of my knowledge haproxy does not support UDP at this time.

@voipoclay,

F5 has a UDP Health Check that fails when it receives ICMP_UNREACHABLE from the pool member, which is an interesting approach, however it also sends some arbitrary string which must be filtered out of logstash as a side effect.

Another approach could be to use the Logstash Heartbeat Plugin, then have your load balancer check elasticsearch every so often. I am currently working on a blog post that provides some more detail & configuration examples on this topic.

UDP sometimes comes out of sequence as well which can affect multiline log processing.

TCP is the preferred approach, if it's an option.

Yes, a detailed blog post on this topic would be appreciable. Looking forward to get a clear picture on this regard.

Please notify with the corresponding link, once the blog post is ready.

Cheers!!!

In my applications I typically will use either writing to disk or using syslog UDP 514 and then setup logstash-forwarder on my servers and logstash-forwarder is fast/reliable/TCP etc.

Logstash-forwarder is also very network fault tolerant and it will keep trying to transmit logs until logstash server acknowledges that it got it.

I also think using redis as a relay works really well too.