De-duplicating with MURMUR3 vs SHA256

Hi, has anyone used the fingerprint plugin with MURMUR3? So far I find it has quite high collision rate. Even with just a few hundred thousands records managed to get 20 collisions.

Testing with sha256 over 2million records and no collisions so far. I'm ok with a some collisions. But not what MURMUR3 produced. Just wondering if this article should be updated: https://www.elastic.co/blog/logstash-lessons-handling-duplicates

In both scenarios I'm using the message and the kafka offset as the fields to hash.

@javadevmtl

Tangent: Have you measured the performance difference of sha256 vs murmur3 in your use case?

Also, FYI, this issue seeking 128bit murmur3 support might also be worth following if murmur3 is important for you.

SHA 256 is quite long. Whether this is required will depend on the data volume. It might be worthwhile trying out MD5 or SHA1 as well.

@Mike.Barretta lol that explains it... I thought the logstash murmur3 was the 128bit version.

As for performance it seems the same to me. I eyeballed the graphs of logstash in kibana and they look pretty close... Im running 3 logstash nodes ingesting off 18 partition topic on kafka.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.