FScrawler - Duplicate Files

Hello All,

I am a newbie to elastic search.

I am interested in using FScrawler to index all my files, then Elasticsearch find duplicates.

I'm thinking of using the md5 function to hash all the files: https://fscrawler.readthedocs.io/en/fscrawler-2.5/admin/fs/local-fs.html#file-checksum

Then use something like https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html to search for duplicates.

Or would something like this be a better approach.

Or is there a better way?


Hey Maurice

It could be easier not to index duplicated files in the first place.

Would that be an option?

Otherwise, I think the last link you shared is the good way.

Unfortunately I do not control the source of the files being indexed. Thank you for the response.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.