I'm currently running a single logstash and interested in creating several for resilience in case one goes down, and for improving performance. I'm ingesting data currently with the http input plugin.
I'm thinking about using the Kinesis input plugin to accomplish this and from my understanding (from reading past posts) if I set the application_name within the logstash input to the same value on two separate machines, that will automatically work as a load balancer. Am I correct?
Will it both distribute the load between the 2 logstashs and also handle if one logstash goes down?
A common way to do this is using a message queue like Kafka.
You would send your data to Kafka and configure your logstash nodes to read from Kafka using the same group id, this way if one logstash node goes down, the others would keep consuming.
But this depend on how you ingest data and if you can change the way that you ingest data.
Never used Kinesis, so I'm not sure if it will do what you want.
In the case of the kafka input in logstash you have a option that will tell the Kafka brokers the the all the consumers are part of the same consumer group, so if one of the nodes goes down the other ones will get the messages and you will not have duplicated messages.
If the application_name option in Kineses works the same way, then it will load balance the messages between your nodes withouth duplicating, but you will need to test this.
This is only true if everything supports and is configured for "exactly-once" semantics. Otherwise Kafka provides only "at least once" guarantees. Under normal circumstances you will not have duplicates, but it is possible.
Anyway, the likehood of having duplicates when using Kafka and Logstash is pretty low, I've been running many pipelines with this configuration and never faced a situation that would cause a duplicated message.
But as rcowart said, It can happen.
In my case I did not have the need in all those years to tackle an inexistent problem for me at the moment.
The use of self-generated IDs is the option I would also recommend. However I would point you to the UUID filter as an option. This may need to be added after installing Logstash.
A common architecture combining Logstash and Kafka is:
Each of these tiers can be scaled independently for performance or redundancy.
While there is an indexing efficiency penalty, with self-generated IDs the benefit when using them with Kafka is that you can at any point "replay" the data through updated pipelines and easily replace existing documents in Elasticsearch.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.