FScrawler - Duplicate Files

Hello All,

I am a newbie to elastic search.

I am interested in using FScrawler to index all my files, then Elasticsearch find duplicates.

I'm thinking of using the md5 function to hash all the files: https://fscrawler.readthedocs.io/en/fscrawler-2.5/admin/fs/local-fs.html#file-checksum

Then use something like https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html to search for duplicates.

Or would something like this be a better approach.

Or is there a better way?


Hey Maurice

It could be easier not to index duplicated files in the first place.

Would that be an option?

Otherwise, I think the last link you shared is the good way.

Unfortunately I do not control the source of the files being indexed. Thank you for the response.