Hi,
I have problem with duplicate documents, so I am using method with Logstash described here:
It seems to do the job as after dedublication I get smaller number of documents. But then I get another problem, then I use SHA256 for hashing index takes double amount of space then original and when I use MURMUR3 it takes a little bit less space, witch is normal less documents -> less space.
Mapping is identical, and documents themselves look save apart of a lot longer "_id" with SHA256.
I can not use MURMUR3 because I have indexes with more documents when MURMUR3 hashing can generate unique IDs.
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open telegraf-firewallconnections-2020.01 SIbOVfRCSfCokaEX1ILBZg 1 0 1237508 0 74.2mb 74.2mb
green open telegraf-firewallconnections-2020.01mur3 N6P5fiM-Sm6UfEQlOu2aVQ 1 1 1188297 436 119.4mb 59.8mb
green open telegraf-firewallconnections-2020.01sha256 3nMtJTKcSnCQZxFRGvcrzQ 1 1 1188470 242 305.4mb 152.7mb
SHA256 generates a very long key, so will take up a lot of space. Have you tried a SHA1 hash with base64encode enabled? This is a good hash that is shorter than SHA256 and base64 encoding shrinks it a bit.
@Christian_Dahlqvist thanks for a tip , I have tried all hashing methods and I can see that MD5 with base64encode enabled takes least amount of space, but still more than original 94.8mb compered to 74.2mb.
I would expect it to take up more specs, so that is not surprising. It is the price to pay for avoiding duplicates. Be sure that you forcemerge your indices down to 1 segment and index the same data into them for a fair comparison.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.