How to calculate similarity between files, while i just index their url and metadata

Hi,
I'm using FSCrawler to index files to ES, nice solution BTW, but I don't want to index the content so I just index metadata and the URL, so far so good, and my final goal is to calculate the similarity between these files with their content but without indexing it (the user will add new file and we will give him some files that are similar with scores).
can I make the comparison by reading the files directly through the URL and compare it by the content?

1 Like

Nope, you'd need to do that in code external to Elasticsearch.

can you give me some idea plz, you mean i can indeed index the url and metadata in ES and when it comes to similarity i can't happen in ES but it can be done in external code? and in external code what libraries or projects that can help me like Tika or idk.

Depends what you mean by similarity I guess?

the user will add his file and i need to give him files that are similar to his ( simply similar in the content nothing complex) also every file will have some score for similatrity. i guess same meaning for similarity in ES.

Ok, so how will you calculate that?

i'm asking for ideas to help me, because i can do that in ES but by indexing the content of the files, and when i give a certain file it returns files that are similar with a score.
i'm asking how i can calculate similarity but without indexing the content.

If I'm guessing right he want's to calculate a simularity based on one or more of the metadata fields. Let's say filename and author. I can think of something like:

"CV of Jeroen.docx" and "Copy of cv of Jeroen.docx". Both with author "Jeroen".

It appears to be the content, not the metadata.

@rvanegmond no not based on metadata but on content, which is not indexed in ES, I know it is not logic, but the reason why I think of that is that I tried to index the content in ES and calculate similarity and it works but I have so many large files and it doesn't seem logical to index all of their content, I'm looking for a way to be able to calculate similarity with just indexing URL and metadata and maybe at the time of calculating, it reads the files.

Thanks! :blush:

About the question, I wonder if you could use this ingest-anonymize plugin to compute a fingerprint at index time which will help to identify similar documents may be?

FSCrawler can apply an ingest pipeline as explained in Elasticsearch settings — FSCrawler 2.10-SNAPSHOT documentation

May be this could help? Note that this requires to index the document.

Another solution could be to start FSCrawler as a REST Service and use the simulate endpoint.

Once you get back the JSON content (which is not indexed then), you can compute a fingerprint using your code or an ingest pipeline again and use the simulate ingest API.

Only some thoughts here. Not sure it helps.

1 Like

GREAT thoughts ! i will explore them and try to solve my problem here and will get back to you for feedback or questions

Hi,
unfortunately, I could not use what you have suggested because I didn't find a way to compute similarity with the fingerprints or hashing.
but I think I need a simple solution so now I'm just trying to compute similarities between files ( or just given a word a get the most relevant documents, like a search engine) with just the metadata, that I index in ES, of the documents.
my question here is there a way to make sure that the metadata holds always a good and enough info about the document?
what do you think about the plugin carrot2 can it help ? https://github.com/carrot2/elasticsearch-carrot2

I don't know it.

Do you mean "can I be guaranteed that FSCrawler / Tika will always extract metadata?" I guess it depends on your source documents but unless there is a bug I don't think that Tika will extract less data.

ok thank you I think it will works

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.