Need help on similarity ranking approach

Hi,

I'm new to elasticsearch and i wanted to do similarity ranking using it.

Requirement.

need to index documents having two fields (field1 and field2) which are of free text. whenever a new document comes and indexed, needs find out how similar is with the existing documents based on a filed (say field1) and these similarities should be captured. And if the similarity reaches some X%, some action should be done. these steps should be done for all documents which are getting indexed.

My approach

  1. whenever a new document comes, it should be indexed first.
  2. once its indexed, will start matching the document against existing documents using the field1.
  3. on the search results, will check the score field for the similarity percentage and will be captured.
  4. find scores which is x%, then do the required action.

Could you please tell whether the approach taken is fine? or have any better way to perform the similarity ranking in such cases?

Thanks

Could you guys please help on this?

I'm not sure you can use the score to determine % similarity. You certainly
can for each new incoming document, run a more like this query against your
index (and specify a bunch of parameters like percent_terms_to_match) to
perhaps achieve something closer to what you want?

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html#query-dsl-mlt-query

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/97e2f5bf-1c95-4775-a894-74650cccde12%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

The problem with your approach is that Lucene does not provide a score in
terms of how similar a document is to a query. The score is based on the
(default) TFIDF algorithm and is not an absolute measure. You can score a
document against all others, and the scores will be comparable for that one
document, but the overall score can vary greatly.

For example, the range of scores of one document against all others might
be 0.5 - 30. The range of scores for another document against the same
documents might be 1.2 - 24. It would be difficult to establish an overall
threshold. You can of course, always find the top % of documents.

The other issue is that the similarity will change as you index more
documents. If you only have one document in your index, the similarity
score for the next document should different than if you indexed against an
index with millions of documents because of the IDF values.

Even if your range of scores is comparable between documents, there is
nothing in Elasticsearch to help you with this task. The better question is
why do you need to calculate document relevancy between documents and not
simply rank documents according to a query?

--
Ivan

On Mon, Apr 28, 2014 at 12:34 AM, Rgs rakesh_gs@infosys.com wrote:

Could you guys please help on this?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4054889.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1398670453057-4054889.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQByimRSWh9%3D%2BzyJfKG9ijzH-zWWBaVdq7Xc1SvjMeBKTg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks Binh Ly and Ivan Brusic for your replies.

I need to find the similarity in percentage of a document against other documents and this will be considered for grouping the documents.

is it possible to get the similarity percentage using more like this query? or is any other way to calculate the percentage of similarity from the query result?

Eg: document1 is 90% similar to document2.
document1 is 45% similar to document3
etc..

Thanks

Hello,

What you want to know is the score of the document that has matched itself
using more like this. The API excludes the queried document. However, it is
equivalent to running a boolean query of more like this field for each of
the queried document field. This will give you as top result, the document
that has matched itself, so that you can compute the percentage of
similarity of the remaining matched documents.

Alex

On Friday, May 2, 2014 3:22:34 PM UTC+2, Rgs wrote:

Thanks Binh Ly and Ivan Brusic for your replies.

I need to find the similarity in percentage of a document against other
documents and this will be considered for grouping the documents.

is it possible to get the similarity percentage using more like this
query?
or is any other way to calculate the percentage of similarity from the
query
result?

Eg: document1 is 90% similar to document2.
document1 is 45% similar to document3
etc..

Thanks

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4055227.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/05db016b-1c2e-497c-9275-37dcccedfae3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

hi,

What i did now is, i have created a custom similarity & similarity provider class which extends DefaultSimilarity and AbstractSimilarityProvider classes respectively and overridden the idf() method to return 1.

Now I'm getting some percentage values like 1, 0.987, 0.876 etc and interpret it as 100%, 98%, 87% etc.

Can you please confirm whether this approach can be taken for finding the percentage of similarity?

sorry for the late reply.

Thanks
Rgs

Hello,

I am not sure that would work. I'd first index you document, and then use
mlt with this document id and include set to true (added in latest ES
release). Then you'll know how "far" your documents are from the queried
document. Also, make sure to pick up most of the terms, by
setting percent_terms_to_match=0, max_query_terms=high value
and min_doc_freq=1. In order to know what terms from the queried document
have matched in the response, you can use explain.

Alex

On Thursday, May 29, 2014 10:42:47 AM UTC+2, Rgs wrote:

hi,

What i did now is, i have created a custom similarity & similarity
provider
class which extends DefaultSimilarity and AbstractSimilarityProvider
classes
respectively and overridden the idf() method to return 1.

Now I'm getting some percentage values like 1, 0.987, 0.876 etc and
interpret it as 100%, 98%, 87% etc.

Can you please confirm whether this approach can be taken for finding the
percentage of similarity?

sorry for the late reply.

Thanks
Rgs

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4056680.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/184a015f-fe68-4a24-999b-367d60d23798%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Also this plugin could provide a solution to your problem:

http://yannbrrd.github.io/

On Thursday, May 29, 2014 10:42:47 AM UTC+2, Rgs wrote:

hi,

What i did now is, i have created a custom similarity & similarity
provider
class which extends DefaultSimilarity and AbstractSimilarityProvider
classes
respectively and overridden the idf() method to return 1.

Now I'm getting some percentage values like 1, 0.987, 0.876 etc and
interpret it as 100%, 98%, 87% etc.

Can you please confirm whether this approach can be taken for finding the
percentage of similarity?

sorry for the late reply.

Thanks
Rgs

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4056680.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d4a2ee12-b9af-4142-a2e9-71b85cc9141c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.