How to calculate similarity between files, while i just index their url and metadata

FeizNouri · July 10, 2019, 11:02am

Hi,
I'm using FSCrawler to index files to ES, nice solution BTW, but I don't want to index the content so I just index metadata and the URL, so far so good, and my final goal is to calculate the similarity between these files with their content but without indexing it (the user will add new file and we will give him some files that are similar with scores).
can I make the comparison by reading the files directly through the URL and compare it by the content?

warkolm · July 10, 2019, 11:03am

Nope, you'd need to do that in code external to Elasticsearch.

FeizNouri · July 10, 2019, 11:08am

can you give me some idea plz, you mean i can indeed index the url and metadata in ES and when it comes to similarity i can't happen in ES but it can be done in external code? and in external code what libraries or projects that can help me like Tika or idk.

warkolm · July 10, 2019, 11:09am

Depends what you mean by similarity I guess?

FeizNouri · July 10, 2019, 11:14am

the user will add his file and i need to give him files that are similar to his ( simply similar in the content nothing complex) also every file will have some score for similatrity. i guess same meaning for similarity in ES.

warkolm · July 10, 2019, 11:15am

Ok, so how will you calculate that?

FeizNouri · July 10, 2019, 11:30am

i'm asking for ideas to help me, because i can do that in ES but by indexing the content of the files, and when i give a certain file it returns files that are similar with a score.
i'm asking how i can calculate similarity but without indexing the content.

rvanegmond · July 10, 2019, 11:49am

If I'm guessing right he want's to calculate a simularity based on one or more of the metadata fields. Let's say filename and author. I can think of something like:

"CV of Jeroen.docx" and "Copy of cv of Jeroen.docx". Both with author "Jeroen".

warkolm · July 10, 2019, 11:49am

It appears to be the content, not the metadata.

FeizNouri · July 10, 2019, 11:54am

@rvanegmond no not based on metadata but on content, which is not indexed in ES, I know it is not logic, but the reason why I think of that is that I tried to index the content in ES and calculate similarity and it works but I have so many large files and it doesn't seem logical to index all of their content, I'm looking for a way to be able to calculate similarity with just indexing URL and metadata and maybe at the time of calculating, it reads the files.

dadoonet · July 10, 2019, 2:23pm

Thanks!

About the question, I wonder if you could use this ingest-anonymize plugin to compute a fingerprint at index time which will help to identify similar documents may be?

FSCrawler can apply an ingest pipeline as explained in Elasticsearch settings — FSCrawler 2.10-SNAPSHOT documentation

May be this could help? Note that this requires to index the document.

Another solution could be to start FSCrawler as a REST Service and use the simulate endpoint.

Once you get back the JSON content (which is not indexed then), you can compute a fingerprint using your code or an ingest pipeline again and use the simulate ingest API.

Only some thoughts here. Not sure it helps.

FeizNouri · July 10, 2019, 3:19pm

GREAT thoughts ! i will explore them and try to solve my problem here and will get back to you for feedback or questions

FeizNouri · July 17, 2019, 8:11am

Hi,
unfortunately, I could not use what you have suggested because I didn't find a way to compute similarity with the fingerprints or hashing.
but I think I need a simple solution so now I'm just trying to compute similarities between files ( or just given a word a get the most relevant documents, like a search engine) with just the metadata, that I index in ES, of the documents.
my question here is there a way to make sure that the metadata holds always a good and enough info about the document?
what do you think about the plugin carrot2 can it help ? https://github.com/carrot2/elasticsearch-carrot2

dadoonet · July 17, 2019, 8:38am

I don't know it.

Do you mean "can I be guaranteed that FSCrawler / Tika will always extract metadata?" I guess it depends on your source documents but unless there is a bug I don't think that Tika will extract less data.

FeizNouri · July 17, 2019, 2:13pm

ok thank you I think it will works

system · August 14, 2019, 2:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Term vectors for computing document similarity Elasticsearch	7	1347	July 6, 2017
Topic Modeling Similarity Elasticsearch	2	1695	July 6, 2017
Finding similar documents with Elasticsearch Elasticsearch	4	420	July 6, 2017
Index files on files system in Elasticsearch Elasticsearch	3	385	November 13, 2018
Document Similarity Elasticsearch	1	382	July 6, 2017

How to calculate similarity between files, while i just index their url and metadata

Related topics