I'm new to elasticsearch and i wanted to do similarity ranking using it.
Requirement.
need to index documents having two fields (field1 and field2) which are of free text. whenever a new document comes and indexed, needs find out how similar is with the existing documents based on a filed (say field1) and these similarities should be captured. And if the similarity reaches some X%, some action should be done. these steps should be done for all documents which are getting indexed.
My approach
whenever a new document comes, it should be indexed first.
once its indexed, will start matching the document against existing documents using the field1.
on the search results, will check the score field for the similarity percentage and will be captured.
find scores which is x%, then do the required action.
Could you please tell whether the approach taken is fine? or have any better way to perform the similarity ranking in such cases?
I'm not sure you can use the score to determine % similarity. You certainly
can for each new incoming document, run a more like this query against your
index (and specify a bunch of parameters like percent_terms_to_match) to
perhaps achieve something closer to what you want?
The problem with your approach is that Lucene does not provide a score in
terms of how similar a document is to a query. The score is based on the
(default) TFIDF algorithm and is not an absolute measure. You can score a
document against all others, and the scores will be comparable for that one
document, but the overall score can vary greatly.
For example, the range of scores of one document against all others might
be 0.5 - 30. The range of scores for another document against the same
documents might be 1.2 - 24. It would be difficult to establish an overall
threshold. You can of course, always find the top % of documents.
The other issue is that the similarity will change as you index more
documents. If you only have one document in your index, the similarity
score for the next document should different than if you indexed against an
index with millions of documents because of the IDF values.
Even if your range of scores is comparable between documents, there is
nothing in Elasticsearch to help you with this task. The better question is
why do you need to calculate document relevancy between documents and not
simply rank documents according to a query?
I need to find the similarity in percentage of a document against other documents and this will be considered for grouping the documents.
is it possible to get the similarity percentage using more like this query? or is any other way to calculate the percentage of similarity from the query result?
Eg: document1 is 90% similar to document2.
document1 is 45% similar to document3
etc..
What you want to know is the score of the document that has matched itself
using more like this. The API excludes the queried document. However, it is
equivalent to running a boolean query of more like this field for each of
the queried document field. This will give you as top result, the document
that has matched itself, so that you can compute the percentage of
similarity of the remaining matched documents.
Alex
On Friday, May 2, 2014 3:22:34 PM UTC+2, Rgs wrote:
Thanks Binh Ly and Ivan Brusic for your replies.
I need to find the similarity in percentage of a document against other
documents and this will be considered for grouping the documents.
is it possible to get the similarity percentage using more like this
query?
or is any other way to calculate the percentage of similarity from the
query
result?
Eg: document1 is 90% similar to document2.
document1 is 45% similar to document3
etc..
What i did now is, i have created a custom similarity & similarity provider class which extends DefaultSimilarity and AbstractSimilarityProvider classes respectively and overridden the idf() method to return 1.
Now I'm getting some percentage values like 1, 0.987, 0.876 etc and interpret it as 100%, 98%, 87% etc.
Can you please confirm whether this approach can be taken for finding the percentage of similarity?
I am not sure that would work. I'd first index you document, and then use
mlt with this document id and include set to true (added in latest ES
release). Then you'll know how "far" your documents are from the queried
document. Also, make sure to pick up most of the terms, by
setting percent_terms_to_match=0, max_query_terms=high value
and min_doc_freq=1. In order to know what terms from the queried document
have matched in the response, you can use explain.
Alex
On Thursday, May 29, 2014 10:42:47 AM UTC+2, Rgs wrote:
hi,
What i did now is, i have created a custom similarity & similarity
provider
class which extends DefaultSimilarity and AbstractSimilarityProvider
classes
respectively and overridden the idf() method to return 1.
Now I'm getting some percentage values like 1, 0.987, 0.876 etc and
interpret it as 100%, 98%, 87% etc.
Can you please confirm whether this approach can be taken for finding the
percentage of similarity?
On Thursday, May 29, 2014 10:42:47 AM UTC+2, Rgs wrote:
hi,
What i did now is, i have created a custom similarity & similarity
provider
class which extends DefaultSimilarity and AbstractSimilarityProvider
classes
respectively and overridden the idf() method to return 1.
Now I'm getting some percentage values like 1, 0.987, 0.876 etc and
interpret it as 100%, 98%, 87% etc.
Can you please confirm whether this approach can be taken for finding the
percentage of similarity?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.