Fingerprint- SHA256 takes a lot more space then MURMUR3

Hi,
I have problem with duplicate documents, so I am using method with Logstash described here:

It seems to do the job as after dedublication I get smaller number of documents. But then I get another problem, then I use SHA256 for hashing index takes double amount of space then original and when I use MURMUR3 it takes a little bit less space, witch is normal less documents -> less space.
Mapping is identical, and documents themselves look save apart of a lot longer "_id" with SHA256.
I can not use MURMUR3 because I have indexes with more documents when MURMUR3 hashing can generate unique IDs.

health status index                                      uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   telegraf-firewallconnections-2020.01       SIbOVfRCSfCokaEX1ILBZg   1   0    1237508            0     74.2mb         74.2mb
green  open   telegraf-firewallconnections-2020.01mur3   N6P5fiM-Sm6UfEQlOu2aVQ   1   1    1188297          436    119.4mb         59.8mb
green  open   telegraf-firewallconnections-2020.01sha256 3nMtJTKcSnCQZxFRGvcrzQ   1   1    1188470          242    305.4mb        152.7mb

So why SHA256 takes so much space?

Mapping:

{
    "telegraf-firewallconnections-2020.01sha256": {
        "aliases": {},
        "mappings": {
            "doc": {
                "properties": {
                    "@timestamp": {
                        "type": "date"
                    },
                    "@version": {
                        "type": "keyword",
                        "ignore_above": 512
                    },
                    "firewallconnections": {
                        "properties": {
                            "firewallmetric1": {
                                "type": "float",
                                "index": false
                            },
                            "firewallmetric2": {
                                "type": "float",
                                "index": false
                            },
                            "firewallmetric3": {
                                "type": "float",
                                "index": false
                            },
                            "firewallmetric4": {
                                "type": "float",
                                "index": false
                            },
                            "firewallmetric5": {
                                "type": "float",
                                "index": false
                            },
                            "firewallmetric6": {
                                "type": "float",
                                "index": false
                            }
                        }
                    },
                    "measurement_name": {
                        "type": "keyword"
                    },
                    "tag": {
                        "properties": {
                            "agent_host": {
                                "type": "keyword",
                                "ignore_above": 512
                            },
                            "hostname": {
                                "type": "keyword",
                                "ignore_above": 512
                            },
                            "index": {
                                "type": "keyword",
                                "ignore_above": 512
                            },
                            "measurement_tag": {
                                "type": "keyword",
                                "ignore_above": 512
                            },
                            "platform_tag": {
                                "type": "keyword",
                                "ignore_above": 512
                            }
                        }
                    }
                }
            }
        },
        "settings": {
            "index": {
                "codec": "best_compression",
                "number_of_shards": "1",
                "provided_name": "telegraf-firewallconnections-2020.01sha256",
                "creation_date": "1585503480392",
                "number_of_replicas": "1",
                "uuid": "3nMtJTKcSnCQZxFRGvcrzQ",
                "version": {
                    "created": "6080399"
                }
            }
        }
    }
}

Logstash config:

input {
  elasticsearch {
        hosts => "https://myelastic:9200"
        password => "password"
        user => "uername"
        index => "telegraf-firewallconnections-2020.01"
  }
}
filter {
    fingerprint {
        key => "1234ABCD"
        method => "SHA256"
        source => ["@timestamp","firewallconnections", "tag"]
        target => "[@metadata][generated_id]"
        concatenate_sources => true
    }
    mutate {
        remove_field => ["@version"]
    }
}
output {
        elasticsearch {
        hosts => "https://myelastic:9200"
        password => "password"
        user => "username"
        index => "telegraf-firewallconnections-2020.01sha256"
        document_id => "%{[@metadata][generated_id]}"
    }
}

SHA256 generates a very long key, so will take up a lot of space. Have you tried a SHA1 hash with base64encode enabled? This is a good hash that is shorter than SHA256 and base64 encoding shrinks it a bit.

@Christian_Dahlqvist thanks for a tip , I have tried all hashing methods and I can see that MD5 with base64encode enabled takes least amount of space, but still more than original 94.8mb compered to 74.2mb.

I would expect it to take up more specs, so that is not surprising. It is the price to pay for avoiding duplicates. Be sure that you forcemerge your indices down to 1 segment and index the same data into them for a fair comparison.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.