Need to load balance Logstash

I'm currently running a single logstash and interested in creating several for resilience in case one goes down, and for improving performance. I'm ingesting data currently with the http input plugin.

I'm thinking about using the Kinesis input plugin to accomplish this and from my understanding (from reading past posts) if I set the application_name within the logstash input to the same value on two separate machines, that will automatically work as a load balancer. Am I correct?

Will it both distribute the load between the 2 logstashs and also handle if one logstash goes down?

1 Like

A common way to do this is using a message queue like Kafka.

You would send your data to Kafka and configure your logstash nodes to read from Kafka using the same group id, this way if one logstash node goes down, the others would keep consuming.

But this depend on how you ingest data and if you can change the way that you ingest data.

How about doing it with kinesis? For our infrastructure kinesis makes the most sense for us. Does it work in the same way?

Never used Kinesis, so I'm not sure if it will do what you want.

In the case of the kafka input in logstash you have a option that will tell the Kafka brokers the the all the consumers are part of the same consumer group, so if one of the nodes goes down the other ones will get the messages and you will not have duplicated messages.

If the application_name option in Kineses works the same way, then it will load balance the messages between your nodes withouth duplicating, but you will need to test this.

This is only true if everything supports and is configured for "exactly-once" semantics. Otherwise Kafka provides only "at least once" guarantees. Under normal circumstances you will not have duplicates, but it is possible.

And how can I reduce the likelihood of duplicates or what can I do about it?

The avoid having duplicates you would need to use a self-generated unique id for your documents.

If your original documents already have an unique id, then you can use this id as the document _id field in elasticsearch.

If they do not have an unique id, then you can create one combining some fields using the fingerprint filter.

In both cases you would need to use the option document_id in your elasticsearch ouput in logstash.

You can read more in this blog post by elastic.

Anyway, the likehood of having duplicates when using Kafka and Logstash is pretty low, I've been running many pipelines with this configuration and never faced a situation that would cause a duplicated message.

But as rcowart said, It can happen.

In my case I did not have the need in all those years to tackle an inexistent problem for me at the moment.

Awesome thanks for the response. For my use case, 95% accuracy is what's needed so a few duplicates won't make or break us thanks for the help!

The use of self-generated IDs is the option I would also recommend. However I would point you to the UUID filter as an option. This may need to be added after installing Logstash.

A common architecture combining Logstash and Kafka is:

collect (apply UUID here) --> Kafka --> processing --> Kafka --> outputs

Each of these tiers can be scaled independently for performance or redundancy.

While there is an indexing efficiency penalty, with self-generated IDs the benefit when using them with Kafka is that you can at any point "replay" the data through updated pipelines and easily replace existing documents in Elasticsearch.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.