Precise similarity-based scoring?

AusTiN · June 17, 2013, 12:59pm

Hi,

I'm working on a task to reproduce a (already working in another
infrastructure) algorithm of detecting document inter-similarity
in a big array of documents, trying to benefit from ElasticSearch's speed
versus own own sloppy index.

Each incoming document gets split into 4-grams (shingles), throwing away
all words less than 4 characters long on the way. In our own
version of the algorithm, this creates patterns unique enough to match them
one-by-one. Final score from document A to document B is
number of matching shingles in A and B divided by total number of shingles.
Precision of the algorithm is good enough for us at the
moment.

Is there any way to reproduce this scheme in ES?

We've tried doing so using:

More like this
Split document A into shingles, then create query like

{
"query": {
"bool": {
"should": [
{
"match": {
"content": "shingle1"
}
},
{
"match": {
"content": "shingle2"
}
},

etc., with all the shingles.

While the result we get is quite similar to one we receive from our
algorithm, there's no way to map the score to some absolute scale (like
from 1 to 100, with score absolute according to all documents in set). The
closest candidate to what we're looking for is finding the ID of document
A, using it's match score as 100%, then recalculating all scores relative
to this one.

However, the current similarity scheme is not really reverse-mappable into
our scale. What direction should we look up to - hacking some scoring
parameters or going straight to writing our own similarity plugin?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Finding similarity between docs to avoid data duplication by abusers Elasticsearch	2	411	June 21, 2019
Document Similarity Elasticsearch	1	366	July 6, 2017
Scoring docs that are not natural language Elasticsearch	7	777	July 5, 2017
Plagiarism detection Elasticsearch	6	4752	July 5, 2017
Efficient similarity scoring question Elasticsearch	3	390	July 6, 2017

Precise similarity-based scoring?

Related topics