Fingerprint- SHA256 takes a lot more space then MURMUR3

aurimas · March 29, 2020, 6:05pm

Hi,
I have problem with duplicate documents, so I am using method with Logstash described here:

It seems to do the job as after dedublication I get smaller number of documents. But then I get another problem, then I use SHA256 for hashing index takes double amount of space then original and when I use MURMUR3 it takes a little bit less space, witch is normal less documents -> less space.
Mapping is identical, and documents themselves look save apart of a lot longer "_id" with SHA256.
I can not use MURMUR3 because I have indexes with more documents when MURMUR3 hashing can generate unique IDs.

health status index                                      uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   telegraf-firewallconnections-2020.01       SIbOVfRCSfCokaEX1ILBZg   1   0    1237508            0     74.2mb         74.2mb
green  open   telegraf-firewallconnections-2020.01mur3   N6P5fiM-Sm6UfEQlOu2aVQ   1   1    1188297          436    119.4mb         59.8mb
green  open   telegraf-firewallconnections-2020.01sha256 3nMtJTKcSnCQZxFRGvcrzQ   1   1    1188470          242    305.4mb        152.7mb

So why SHA256 takes so much space?

aurimas · March 29, 2020, 6:13pm

Mapping:

{
    "telegraf-firewallconnections-2020.01sha256": {
        "aliases": {},
        "mappings": {
            "doc": {
                "properties": {
                    "@timestamp": {
                        "type": "date"
                    },
                    "@version": {
                        "type": "keyword",
                        "ignore_above": 512
                    },
                    "firewallconnections": {
                        "properties": {
                            "firewallmetric1": {
                                "type": "float",
                                "index": false
                            },
                            "firewallmetric2": {
                                "type": "float",
                                "index": false
                            },
                            "firewallmetric3": {
                                "type": "float",
                                "index": false
                            },
                            "firewallmetric4": {
                                "type": "float",
                                "index": false
                            },
                            "firewallmetric5": {
                                "type": "float",
                                "index": false
                            },
                            "firewallmetric6": {
                                "type": "float",
                                "index": false
                            }
                        }
                    },
                    "measurement_name": {
                        "type": "keyword"
                    },
                    "tag": {
                        "properties": {
                            "agent_host": {
                                "type": "keyword",
                                "ignore_above": 512
                            },
                            "hostname": {
                                "type": "keyword",
                                "ignore_above": 512
                            },
                            "index": {
                                "type": "keyword",
                                "ignore_above": 512
                            },
                            "measurement_tag": {
                                "type": "keyword",
                                "ignore_above": 512
                            },
                            "platform_tag": {
                                "type": "keyword",
                                "ignore_above": 512
                            }
                        }
                    }
                }
            }
        },
        "settings": {
            "index": {
                "codec": "best_compression",
                "number_of_shards": "1",
                "provided_name": "telegraf-firewallconnections-2020.01sha256",
                "creation_date": "1585503480392",
                "number_of_replicas": "1",
                "uuid": "3nMtJTKcSnCQZxFRGvcrzQ",
                "version": {
                    "created": "6080399"
                }
            }
        }
    }
}

Logstash config:

input {
  elasticsearch {
        hosts => "https://myelastic:9200"
        password => "password"
        user => "uername"
        index => "telegraf-firewallconnections-2020.01"
  }
}
filter {
    fingerprint {
        key => "1234ABCD"
        method => "SHA256"
        source => ["@timestamp","firewallconnections", "tag"]
        target => "[@metadata][generated_id]"
        concatenate_sources => true
    }
    mutate {
        remove_field => ["@version"]
    }
}
output {
        elasticsearch {
        hosts => "https://myelastic:9200"
        password => "password"
        user => "username"
        index => "telegraf-firewallconnections-2020.01sha256"
        document_id => "%{[@metadata][generated_id]}"
    }
}

Christian_Dahlqvist · March 29, 2020, 7:49pm

SHA256 generates a very long key, so will take up a lot of space. Have you tried a SHA1 hash with base64encode enabled? This is a good hash that is shorter than SHA256 and base64 encoding shrinks it a bit.

aurimas · March 30, 2020, 1:57pm

@Christian_Dahlqvist thanks for a tip , I have tried all hashing methods and I can see that MD5 with base64encode enabled takes least amount of space, but still more than original 94.8mb compered to 74.2mb.

Christian_Dahlqvist · March 30, 2020, 2:00pm

I would expect it to take up more specs, so that is not surprising. It is the price to pay for avoiding duplicates. Be sure that you forcemerge your indices down to 1 segment and index the same data into them for a fair comparison.

system · April 27, 2020, 2:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash->Elasticsearch document deduplication efficiency & optimization Elasticsearch	3	827	December 12, 2020
De-duplicating with MURMUR3 vs SHA256 Logstash	4	2624	October 15, 2018
Using my own document_id - is there a faster way? Logstash	26	4321	December 27, 2017
Parsing sha-1 value in elk Elasticsearch	21	1795	February 22, 2019
Logstash produces duplicates Logstash	3	1187	July 6, 2017

Fingerprint- SHA256 takes a lot more space then MURMUR3

Related topics