Term vectors for computing document similarity


(Aditya Rajgarhia) #1

Hello,

I'm trying to build some search features for a website. I don't have any
prior experience with search and decided to go with elasticsearch mostly
because of the ease of use.

I am indexing two types of documents (each with it's own index) since I
want to offer search functionality for either type of document. I have this
part working.

Now, I also want to offer a feature for comparing documents from one index
with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in
http://sujitpal.blogspot.in/2011/10/computing-document-similarity-using.html).

However, from what I can tell ES doesn't expose the term vectors. Is it
still possible for me to use ES if I absolutely need the above feature? Is
it possible to read the Lucene index generated by ES directly without too
much trouble?

Of course, I could always generate the term vector dynamically for each
document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors :frowning:

Any help would be appreciated!

Thanks,
Aditya

--


(Loïc Bertron) #2

Hello,

You should have a look at this feature : Fuzzy Like this and More like
this:
http://www.elasticsearch.org/guide/reference/query-dsl/flt-query.html
http://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html

If you compare your text field to all the text fields of all others
documents, you can reduce results only to documents matching 95% and more.

Le vendredi 11 janvier 2013 02:37:28 UTC-5, adi...@blobinfotech.com a
écrit :

Hello,

I'm trying to build some search features for a website. I don't have any
prior experience with search and decided to go with elasticsearch mostly
because of the ease of use.

I am indexing two types of documents (each with it's own index) since I
want to offer search functionality for either type of document. I have this
part working.

Now, I also want to offer a feature for comparing documents from one index
with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in
http://sujitpal.blogspot.in/2011/10/computing-document-similarity-using.html
).

However, from what I can tell ES doesn't expose the term vectors. Is it
still possible for me to use ES if I absolutely need the above feature? Is
it possible to read the Lucene index generated by ES directly without too
much trouble?

Of course, I could always generate the term vector dynamically for each
document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors :frowning:

Any help would be appreciated!

Thanks,
Aditya

--


(Aditya Rajgarhia) #3

Loïc, thanks for the response.

I am familiar with mlt and am already using it to produce similar documents
from each of my indexes. However, for the particular feature that I
described in the last post, I want to explicitly compare several specific
documents from one index with a specific document from the second index and
get the score for each pair. In other words, I don't want to run a
comparison over every document in one or both indexes since there will be a
large number of documents (millions) in each index. My understanding is
that flt/mlt will do that, unfortunately.

If this is still achievable via flt/mlt, could you elaborate a bit on how?

Thanks,
Aditya

On Saturday, January 12, 2013 1:14:26 AM UTC+5:30, Loïc Bertron wrote:

Hello,

You should have a look at this feature : Fuzzy Like this and More like
this:
http://www.elasticsearch.org/guide/reference/query-dsl/flt-query.html
http://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html

If you compare your text field to all the text fields of all others
documents, you can reduce results only to documents matching 95% and more.

Le vendredi 11 janvier 2013 02:37:28 UTC-5, adi...@blobinfotech.com a
écrit :

Hello,

I'm trying to build some search features for a website. I don't have any
prior experience with search and decided to go with elasticsearch mostly
because of the ease of use.

I am indexing two types of documents (each with it's own index) since I
want to offer search functionality for either type of document. I have this
part working.

Now, I also want to offer a feature for comparing documents from one
index with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in
http://sujitpal.blogspot.in/2011/10/computing-document-similarity-using.html
).

However, from what I can tell ES doesn't expose the term vectors. Is it
still possible for me to use ES if I absolutely need the above feature? Is
it possible to read the Lucene index generated by ES directly without too
much trouble?

Of course, I could always generate the term vector dynamically for each
document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors :frowning:

Any help would be appreciated!

Thanks,
Aditya

--


(Pratik Poddar) #4

Aditya, any luck here? Would appreciate if you could share your learning
please? Thanks a ton

Regards,
Pratik Poddar

On Saturday, January 12, 2013 10:20:29 AM UTC+5:30, Aditya Rajgarhia wrote:

Loïc, thanks for the response.

I am familiar with mlt and am already using it to produce similar
documents from each of my indexes. However, for the particular feature that
I described in the last post, I want to explicitly compare several specific
documents from one index with a specific document from the second index and
get the score for each pair. In other words, I don't want to run a
comparison over every document in one or both indexes since there will be a
large number of documents (millions) in each index. My understanding is
that flt/mlt will do that, unfortunately.

If this is still achievable via flt/mlt, could you elaborate a bit on how?

Thanks,
Aditya

On Saturday, January 12, 2013 1:14:26 AM UTC+5:30, Loïc Bertron wrote:

Hello,

You should have a look at this feature : Fuzzy Like this and More like
this:
http://www.elasticsearch.org/guide/reference/query-dsl/flt-query.html
http://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html

If you compare your text field to all the text fields of all others
documents, you can reduce results only to documents matching 95% and more.

Le vendredi 11 janvier 2013 02:37:28 UTC-5, adi...@blobinfotech.com a
écrit :

Hello,

I'm trying to build some search features for a website. I don't have any
prior experience with search and decided to go with elasticsearch mostly
because of the ease of use.

I am indexing two types of documents (each with it's own index) since I
want to offer search functionality for either type of document. I have this
part working.

Now, I also want to offer a feature for comparing documents from one
index with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in
http://sujitpal.blogspot.in/2011/10/computing-document-similarity-using.html
).

However, from what I can tell ES doesn't expose the term vectors. Is it
still possible for me to use ES if I absolutely need the above feature? Is
it possible to read the Lucene index generated by ES directly without too
much trouble?

Of course, I could always generate the term vector dynamically for each
document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors :frowning:

Any help would be appreciated!

Thanks,
Aditya

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/662f4e65-5f78-4cb4-9759-c6976763fc02%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Aditya Rajgarhia) #5

For my purposes I was able to use mlt-field, which is slightly different
from mlt-query and offers you more customizability. Combined with and/or
queries, you can construct some really powerful queries.

For what it's worth, I believe they've recently added a term vectors API as
well, which I didn't use since the above worked better and allowed me to
operate at a higher level.

You can search for all of the above on their docs.

On Thursday, April 24, 2014 3:59:00 PM UTC+5:30, Pratik Poddar wrote:

Aditya, any luck here? Would appreciate if you could share your learning
please? Thanks a ton

Regards,
Pratik Poddar

On Saturday, January 12, 2013 10:20:29 AM UTC+5:30, Aditya Rajgarhia wrote:

Loïc, thanks for the response.

I am familiar with mlt and am already using it to produce similar
documents from each of my indexes. However, for the particular feature that
I described in the last post, I want to explicitly compare several specific
documents from one index with a specific document from the second index and
get the score for each pair. In other words, I don't want to run a
comparison over every document in one or both indexes since there will be a
large number of documents (millions) in each index. My understanding is
that flt/mlt will do that, unfortunately.

If this is still achievable via flt/mlt, could you elaborate a bit on how?

Thanks,
Aditya

On Saturday, January 12, 2013 1:14:26 AM UTC+5:30, Loïc Bertron wrote:

Hello,

You should have a look at this feature : Fuzzy Like this and More like
this:
http://www.elasticsearch.org/guide/reference/query-dsl/flt-query.html
http://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html

If you compare your text field to all the text fields of all others
documents, you can reduce results only to documents matching 95% and more.

Le vendredi 11 janvier 2013 02:37:28 UTC-5, adi...@blobinfotech.com a
écrit :

Hello,

I'm trying to build some search features for a website. I don't have
any prior experience with search and decided to go with elasticsearch
mostly because of the ease of use.

I am indexing two types of documents (each with it's own index) since I
want to offer search functionality for either type of document. I have this
part working.

Now, I also want to offer a feature for comparing documents from one
index with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in
http://sujitpal.blogspot.in/2011/10/computing-document-similarity-using.html
).

However, from what I can tell ES doesn't expose the term vectors. Is it
still possible for me to use ES if I absolutely need the above feature? Is
it possible to read the Lucene index generated by ES directly without too
much trouble?

Of course, I could always generate the term vector dynamically for each
document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors :frowning:

Any help would be appreciated!

Thanks,
Aditya

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/67fb94c5-0ba5-4630-873e-6dd7be1068f9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Pratik Poddar) #6

Aditya,
Thanks for your reply. But even mlt_field gives you close documents. How do
we measure similarity between two documents? If you are able to solve this,
do you mind sharing the snippet please? Thanks a ton. Really appreciate it.

Regards,
Pratik

On Fri, Apr 25, 2014 at 5:54 PM, Aditya Rajgarhia
aditya@blobinfotech.comwrote:

For my purposes I was able to use mlt-field, which is slightly different
from mlt-query and offers you more customizability. Combined with and/or
queries, you can construct some really powerful queries.

For what it's worth, I believe they've recently added a term vectors API
as well, which I didn't use since the above worked better and allowed me to
operate at a higher level.

You can search for all of the above on their docs.

On Thursday, April 24, 2014 3:59:00 PM UTC+5:30, Pratik Poddar wrote:

Aditya, any luck here? Would appreciate if you could share your learning
please? Thanks a ton

Regards,
Pratik Poddar

On Saturday, January 12, 2013 10:20:29 AM UTC+5:30, Aditya Rajgarhia
wrote:

Loïc, thanks for the response.

I am familiar with mlt and am already using it to produce similar
documents from each of my indexes. However, for the particular feature that
I described in the last post, I want to explicitly compare several specific
documents from one index with a specific document from the second index and
get the score for each pair. In other words, I don't want to run a
comparison over every document in one or both indexes since there will be a
large number of documents (millions) in each index. My understanding is
that flt/mlt will do that, unfortunately.

If this is still achievable via flt/mlt, could you elaborate a bit on
how?

Thanks,
Aditya

On Saturday, January 12, 2013 1:14:26 AM UTC+5:30, Loïc Bertron wrote:

Hello,

You should have a look at this feature : Fuzzy Like this and More like
this:
http://www.elasticsearch.org/guide/reference/query-dsl/flt-query.html
http://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html

If you compare your text field to all the text fields of all others
documents, you can reduce results only to documents matching 95% and more.

Le vendredi 11 janvier 2013 02:37:28 UTC-5, adi...@blobinfotech.com a
écrit :

Hello,

I'm trying to build some search features for a website. I don't have
any prior experience with search and decided to go with elasticsearch
mostly because of the ease of use.

I am indexing two types of documents (each with it's own index) since
I want to offer search functionality for either type of document. I have
this part working.

Now, I also want to offer a feature for comparing documents from one
index with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in
http://sujitpal.blogspot.in/2011/10/computing-document-
similarity-using.html).

However, from what I can tell ES doesn't expose the term vectors. Is
it still possible for me to use ES if I absolutely need the above feature?
Is it possible to read the Lucene index generated by ES directly without
too much trouble?

Of course, I could always generate the term vector dynamically for
each document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors :frowning:

Any help would be appreciated!

Thanks,
Aditya

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/VExh3UhD5Yg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/67fb94c5-0ba5-4630-873e-6dd7be1068f9%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/67fb94c5-0ba5-4630-873e-6dd7be1068f9%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Pratik Poddar
www.linkedin.com/in/pratikpoddar


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAFiYsPc1kAS6Samqx6EqVYUyqL-DOAtg3Lev-rDjtE27rTnSoQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Aditya Rajgarhia) #7

I didn't need to compute scores since chaining and nesting queries allowed
me a much better solution for my needs than I would have been ever been
able to get by writing the algorithm from scratch. Some of these query
types were not available when I posted this thread.

As I said, they've added term vectors recently:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-termvectors.html

Why can't you use this? Also, even before they added this API there was a
way to get term vectors by writing low level code to get the lucene
information from ES.

On Friday, April 25, 2014 5:58:45 PM UTC+5:30, Pratik Poddar wrote:

Aditya,
Thanks for your reply. But even mlt_field gives you close documents. How
do we measure similarity between two documents? If you are able to solve
this, do you mind sharing the snippet please? Thanks a ton. Really
appreciate it.

Regards,
Pratik

On Fri, Apr 25, 2014 at 5:54 PM, Aditya Rajgarhia <adi...@blobinfotech.com<javascript:>

wrote:

For my purposes I was able to use mlt-field, which is slightly different
from mlt-query and offers you more customizability. Combined with and/or
queries, you can construct some really powerful queries.

For what it's worth, I believe they've recently added a term vectors API
as well, which I didn't use since the above worked better and allowed me to
operate at a higher level.

You can search for all of the above on their docs.

On Thursday, April 24, 2014 3:59:00 PM UTC+5:30, Pratik Poddar wrote:

Aditya, any luck here? Would appreciate if you could share your learning
please? Thanks a ton

Regards,
Pratik Poddar

On Saturday, January 12, 2013 10:20:29 AM UTC+5:30, Aditya Rajgarhia
wrote:

Loïc, thanks for the response.

I am familiar with mlt and am already using it to produce similar
documents from each of my indexes. However, for the particular feature that
I described in the last post, I want to explicitly compare several specific
documents from one index with a specific document from the second index and
get the score for each pair. In other words, I don't want to run a
comparison over every document in one or both indexes since there will be a
large number of documents (millions) in each index. My understanding is
that flt/mlt will do that, unfortunately.

If this is still achievable via flt/mlt, could you elaborate a bit on
how?

Thanks,
Aditya

On Saturday, January 12, 2013 1:14:26 AM UTC+5:30, Loïc Bertron wrote:

Hello,

You should have a look at this feature : Fuzzy Like this and More like
this:
http://www.elasticsearch.org/guide/reference/query-dsl/flt-query.html
http://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html

If you compare your text field to all the text fields of all others
documents, you can reduce results only to documents matching 95% and more.

Le vendredi 11 janvier 2013 02:37:28 UTC-5, adi...@blobinfotech.com a
écrit :

Hello,

I'm trying to build some search features for a website. I don't have
any prior experience with search and decided to go with elasticsearch
mostly because of the ease of use.

I am indexing two types of documents (each with it's own index) since
I want to offer search functionality for either type of document. I have
this part working.

Now, I also want to offer a feature for comparing documents from one
index with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in
http://sujitpal.blogspot.in/2011/10/computing-document-
similarity-using.html).

However, from what I can tell ES doesn't expose the term vectors. Is
it still possible for me to use ES if I absolutely need the above feature?
Is it possible to read the Lucene index generated by ES directly without
too much trouble?

Of course, I could always generate the term vector dynamically for
each document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors :frowning:

Any help would be appreciated!

Thanks,
Aditya

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/VExh3UhD5Yg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/67fb94c5-0ba5-4630-873e-6dd7be1068f9%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/67fb94c5-0ba5-4630-873e-6dd7be1068f9%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Pratik Poddar
www.linkedin.com/in/pratikpoddar
http://www.cseblog.com
http://pratikpoddar.wordpress.com/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/315374ec-4236-4bd3-a175-f9f0f481259c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #8