Logstash x handling duplicate

lchan · November 23, 2022, 2:19am

Hi all,

I am trying to increase number of logstash servers for redundancy and want to know if using fingerprint would achieve it.

Does this basically sending the same stream of logs/messages via multiple logstash servers and ES will only update existing document contents after comparing the fingerprint?

Also, a rather dumb question, the document_id are saving in memory? Just want to under the implication on the performance as I didn't find much documentation on the internal.

filter {
  fingerprint {
    source => "message"
    target => "[@metadata][fingerprint]"
    method => "SHA1",
    key => "Log analytics",
    base64encode => true
  }
}
output {
  elasticsearch {
    hosts => "myes.com"
    document_id => "%{[@metadata][fingerprint]}"
  }
}

Badger · November 23, 2022, 4:01am

If you send the same messages through three logstash servers, all writing to the same index with the same document_id, then you more than triple the cost. Three logstash servers triples the cost, and re-indexing or searching to confirm the incoming document does not exist increases the cost on the elasticsearch side.

Personally I would put a kafka cluster in front of logstash, in the hope that consumers would have been rebalanced. Other pub/sub products could probably balance the load across logstash consumers equally well.

This is one of those trade-offs that everyone in engineering makes all the time. You didn't describe any of the reliability constraints or the cost constraints. These (and other) constraints are what we balance every day.

You need a message consumer that acks the message that it consumes and provides data integrity (logstash can do it with PQs, DLQs for some pipelines). As well as a message producer that respects acks.

You have not provided enough info for us to provide support..

lchan · November 23, 2022, 5:27pm

Hi @badger,

If you send the same messages through three logstash servers, all writing to the same index with the same document_id, then you more than triple the cost. Three logstash servers triples the cost, and re-indexing or searching to confirm the incoming document does not exist increases the cost on the elasticsearch side.

Thanks for the suggestion and insight. I am aware adding logstash servers would add compute cost. (We thought about implementing Kafka but that's another consideration as our syslog server spawn in 8 different datacenters.) The short terms is to provide redundancy on our syslog pipleline.

Do we actually want PQ in this case if there multiple logstash servers pushing the messages? For example - logstashA rebooted; logstashB would already wrote the the same index.

system · December 21, 2022, 5:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fingerprint Logstash	1	297	October 10, 2017
Why does it have duplicate ID with normal logstash output and fingerprint filter? Logstash	2	300	March 21, 2021
Avoid duplication Logstash	13	4927	December 7, 2018
Configuring Pipeline To Handle Duplicates In Rollover Indices Logstash	3	1006	September 20, 2019
How not to overwrite duplicates? save old documents Logstash	3	836	July 23, 2020

Logstash x handling duplicate

Related topics