I copy this picture from "https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html", and I can't understand this architecture how to avoid message queue reveiving duplicate data

Can you please not put the question in the topic.

What do you want to know.

From this picture, three logstash shipping instance get data from two data source, and then put data into Message Queue, as for one message data from data source ,three logstash shipping instance will get it and put it to message queue, then there will be three same message in the message queue, how to avoid this?

What I think is meant is that the shippers can have multiple input plugins of various kinds and that they'll all be sent to the queue. The idea is not that all shipper read from exactly the same sources since, as you've figured out yourself, it'll lead to message duplication.

then, what the difference between this two picture? according to the document, it was said the second one is better.


In the top image you have 4 logstash instances. Each instance is only taking data from one source. For example UDP is only going to the top left Logstash instance. It's effectively a one to one relationship between sources and Logstash shipping instances.

This creates a few problems.

  1. If one Logstash instance were to die, then that data source completely fails. There is nothing in place that will auto-route it to a working logstash instance and the other logstash machines only have one Input plugin configured. So you wouldn't be able to just re-point the data source to a new IP, you would also have to configure Logstash to accept this new input type.
  2. Load may not be equal for all inputs. What if your file input plugin gets 10,000 events per second, but your RSS plugin gets 10 events per second. That first Logstash machine will be doing a whole lot of work, while the second one will be sitting around barely doing anything.

In the second image things work differently. Your data sources have 3 different Logstash instances to send data to. Now all 3 instances can accept data from all sources, and the sources can send data to any of the 3. The lines going from each data source to all 3 logstash instances don't mean that it sends the same data 3 separate times. It means that it will round-robin through each server one at a time. So Event 4567 gets sent to Logstash1, Event 4568 gets send to Logstash2 and so on.

Now if one of the Logstash instances dies, there are two others that will take over the load. The data source will recognize that one server isn't responding and so send the event to the next logstash instance in the queue. Unequal load isn't a problem now either since every machine is handling a little bit of everything. Those 10,000 file input events are now being equally divided among 3 machines.

A problem you could come across with the second method is that your data source may not be able to round-robin through a list of Logstash servers. It can only send to one IP. If that's the case you may need to add another step that will load balance requests. HAProxy for example can do this.
You also have to have all 3 logstash instances setup completely the same. Any inputs or filters need to exist on all machines.

thanks a lot...