Logstash x handling duplicate

Hi all,

I am trying to increase number of logstash servers for redundancy and want to know if using fingerprint would achieve it.

Does this basically sending the same stream of logs/messages via multiple logstash servers and ES will only update existing document contents after comparing the fingerprint?

Also, a rather dumb question, the document_id are saving in memory? Just want to under the implication on the performance as I didn't find much documentation on the internal.

filter {
  fingerprint {
    source => "message"
    target => "[@metadata][fingerprint]"
    method => "SHA1",
    key => "Log analytics",
    base64encode => true
  }
}
output {
  elasticsearch {
    hosts => "myes.com"
    document_id => "%{[@metadata][fingerprint]}"
  }
}

If you send the same messages through three logstash servers, all writing to the same index with the same document_id, then you more than triple the cost. Three logstash servers triples the cost, and re-indexing or searching to confirm the incoming document does not exist increases the cost on the elasticsearch side.

Personally I would put a kafka cluster in front of logstash, in the hope that consumers would have been rebalanced. Other pub/sub products could probably balance the load across logstash consumers equally well.

This is one of those trade-offs that everyone in engineering makes all the time. You didn't describe any of the reliability constraints or the cost constraints. These (and other) constraints are what we balance every day.

You need a message consumer that acks the message that it consumes and provides data integrity (logstash can do it with PQs, DLQs for some pipelines). As well as a message producer that respects acks.

You have not provided enough info for us to provide support..

1 Like

Hi @badger,

If you send the same messages through three logstash servers, all writing to the same index with the same document_id, then you more than triple the cost. Three logstash servers triples the cost, and re-indexing or searching to confirm the incoming document does not exist increases the cost on the elasticsearch side.

Thanks for the suggestion and insight. I am aware adding logstash servers would add compute cost. (We thought about implementing Kafka but that's another consideration as our syslog server spawn in 8 different datacenters.) The short terms is to provide redundancy on our syslog pipleline.

Do we actually want PQ in this case if there multiple logstash servers pushing the messages? For example - logstashA rebooted; logstashB would already wrote the the same index.