I'm trying to build some search features for a website. I don't have any
prior experience with search and decided to go with elasticsearch mostly
because of the ease of use.
I am indexing two types of documents (each with it's own index) since I
want to offer search functionality for either type of document. I have this
part working.
Now, I also want to offer a feature for comparing documents from one index
with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in http://sujitpal.blogspot.in/2011/10/computing-document-similarity-using.html).
However, from what I can tell ES doesn't expose the term vectors. Is it
still possible for me to use ES if I absolutely need the above feature? Is
it possible to read the Lucene index generated by ES directly without too
much trouble?
Of course, I could always generate the term vector dynamically for each
document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors
I'm trying to build some search features for a website. I don't have any
prior experience with search and decided to go with elasticsearch mostly
because of the ease of use.
I am indexing two types of documents (each with it's own index) since I
want to offer search functionality for either type of document. I have this
part working.
Now, I also want to offer a feature for comparing documents from one index
with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in Salmon Run: Computing Document Similarity using Lucene Term Vectors
).
However, from what I can tell ES doesn't expose the term vectors. Is it
still possible for me to use ES if I absolutely need the above feature? Is
it possible to read the Lucene index generated by ES directly without too
much trouble?
Of course, I could always generate the term vector dynamically for each
document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors
I am familiar with mlt and am already using it to produce similar documents
from each of my indexes. However, for the particular feature that I
described in the last post, I want to explicitly compare several specific
documents from one index with a specific document from the second index and
get the score for each pair. In other words, I don't want to run a
comparison over every document in one or both indexes since there will be a
large number of documents (millions) in each index. My understanding is
that flt/mlt will do that, unfortunately.
If this is still achievable via flt/mlt, could you elaborate a bit on how?
Thanks,
Aditya
On Saturday, January 12, 2013 1:14:26 AM UTC+5:30, Loïc Bertron wrote:
I'm trying to build some search features for a website. I don't have any
prior experience with search and decided to go with elasticsearch mostly
because of the ease of use.
I am indexing two types of documents (each with it's own index) since I
want to offer search functionality for either type of document. I have this
part working.
Now, I also want to offer a feature for comparing documents from one
index with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in Salmon Run: Computing Document Similarity using Lucene Term Vectors
).
However, from what I can tell ES doesn't expose the term vectors. Is it
still possible for me to use ES if I absolutely need the above feature? Is
it possible to read the Lucene index generated by ES directly without too
much trouble?
Of course, I could always generate the term vector dynamically for each
document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors
Aditya, any luck here? Would appreciate if you could share your learning
please? Thanks a ton
Regards,
Pratik Poddar
On Saturday, January 12, 2013 10:20:29 AM UTC+5:30, Aditya Rajgarhia wrote:
Loïc, thanks for the response.
I am familiar with mlt and am already using it to produce similar
documents from each of my indexes. However, for the particular feature that
I described in the last post, I want to explicitly compare several specific
documents from one index with a specific document from the second index and
get the score for each pair. In other words, I don't want to run a
comparison over every document in one or both indexes since there will be a
large number of documents (millions) in each index. My understanding is
that flt/mlt will do that, unfortunately.
If this is still achievable via flt/mlt, could you elaborate a bit on how?
Thanks,
Aditya
On Saturday, January 12, 2013 1:14:26 AM UTC+5:30, Loïc Bertron wrote:
I'm trying to build some search features for a website. I don't have any
prior experience with search and decided to go with elasticsearch mostly
because of the ease of use.
I am indexing two types of documents (each with it's own index) since I
want to offer search functionality for either type of document. I have this
part working.
Now, I also want to offer a feature for comparing documents from one
index with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in Salmon Run: Computing Document Similarity using Lucene Term Vectors
).
However, from what I can tell ES doesn't expose the term vectors. Is it
still possible for me to use ES if I absolutely need the above feature? Is
it possible to read the Lucene index generated by ES directly without too
much trouble?
Of course, I could always generate the term vector dynamically for each
document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors
For my purposes I was able to use mlt-field, which is slightly different
from mlt-query and offers you more customizability. Combined with and/or
queries, you can construct some really powerful queries.
For what it's worth, I believe they've recently added a term vectors API as
well, which I didn't use since the above worked better and allowed me to
operate at a higher level.
You can search for all of the above on their docs.
On Thursday, April 24, 2014 3:59:00 PM UTC+5:30, Pratik Poddar wrote:
Aditya, any luck here? Would appreciate if you could share your learning
please? Thanks a ton
Regards,
Pratik Poddar
On Saturday, January 12, 2013 10:20:29 AM UTC+5:30, Aditya Rajgarhia wrote:
Loïc, thanks for the response.
I am familiar with mlt and am already using it to produce similar
documents from each of my indexes. However, for the particular feature that
I described in the last post, I want to explicitly compare several specific
documents from one index with a specific document from the second index and
get the score for each pair. In other words, I don't want to run a
comparison over every document in one or both indexes since there will be a
large number of documents (millions) in each index. My understanding is
that flt/mlt will do that, unfortunately.
If this is still achievable via flt/mlt, could you elaborate a bit on how?
Thanks,
Aditya
On Saturday, January 12, 2013 1:14:26 AM UTC+5:30, Loïc Bertron wrote:
I'm trying to build some search features for a website. I don't have
any prior experience with search and decided to go with elasticsearch
mostly because of the ease of use.
I am indexing two types of documents (each with it's own index) since I
want to offer search functionality for either type of document. I have this
part working.
Now, I also want to offer a feature for comparing documents from one
index with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in Salmon Run: Computing Document Similarity using Lucene Term Vectors
).
However, from what I can tell ES doesn't expose the term vectors. Is it
still possible for me to use ES if I absolutely need the above feature? Is
it possible to read the Lucene index generated by ES directly without too
much trouble?
Of course, I could always generate the term vector dynamically for each
document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors
Aditya,
Thanks for your reply. But even mlt_field gives you close documents. How do
we measure similarity between two documents? If you are able to solve this,
do you mind sharing the snippet please? Thanks a ton. Really appreciate it.
For my purposes I was able to use mlt-field, which is slightly different
from mlt-query and offers you more customizability. Combined with and/or
queries, you can construct some really powerful queries.
For what it's worth, I believe they've recently added a term vectors API
as well, which I didn't use since the above worked better and allowed me to
operate at a higher level.
You can search for all of the above on their docs.
On Thursday, April 24, 2014 3:59:00 PM UTC+5:30, Pratik Poddar wrote:
Aditya, any luck here? Would appreciate if you could share your learning
please? Thanks a ton
Regards,
Pratik Poddar
On Saturday, January 12, 2013 10:20:29 AM UTC+5:30, Aditya Rajgarhia
wrote:
Loïc, thanks for the response.
I am familiar with mlt and am already using it to produce similar
documents from each of my indexes. However, for the particular feature that
I described in the last post, I want to explicitly compare several specific
documents from one index with a specific document from the second index and
get the score for each pair. In other words, I don't want to run a
comparison over every document in one or both indexes since there will be a
large number of documents (millions) in each index. My understanding is
that flt/mlt will do that, unfortunately.
If this is still achievable via flt/mlt, could you elaborate a bit on
how?
Thanks,
Aditya
On Saturday, January 12, 2013 1:14:26 AM UTC+5:30, Loïc Bertron wrote:
I'm trying to build some search features for a website. I don't have
any prior experience with search and decided to go with elasticsearch
mostly because of the ease of use.
I am indexing two types of documents (each with it's own index) since
I want to offer search functionality for either type of document. I have
this part working.
Now, I also want to offer a feature for comparing documents from one
index with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in http://sujitpal.blogspot.in/2011/10/computing-document-
similarity-using.html).
However, from what I can tell ES doesn't expose the term vectors. Is
it still possible for me to use ES if I absolutely need the above feature?
Is it possible to read the Lucene index generated by ES directly without
too much trouble?
Of course, I could always generate the term vector dynamically for
each document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors
I didn't need to compute scores since chaining and nesting queries allowed
me a much better solution for my needs than I would have been ever been
able to get by writing the algorithm from scratch. Some of these query
types were not available when I posted this thread.
As I said, they've added term vectors recently:
Why can't you use this? Also, even before they added this API there was a
way to get term vectors by writing low level code to get the lucene
information from ES.
On Friday, April 25, 2014 5:58:45 PM UTC+5:30, Pratik Poddar wrote:
Aditya,
Thanks for your reply. But even mlt_field gives you close documents. How
do we measure similarity between two documents? If you are able to solve
this, do you mind sharing the snippet please? Thanks a ton. Really
appreciate it.
For my purposes I was able to use mlt-field, which is slightly different
from mlt-query and offers you more customizability. Combined with and/or
queries, you can construct some really powerful queries.
For what it's worth, I believe they've recently added a term vectors API
as well, which I didn't use since the above worked better and allowed me to
operate at a higher level.
You can search for all of the above on their docs.
On Thursday, April 24, 2014 3:59:00 PM UTC+5:30, Pratik Poddar wrote:
Aditya, any luck here? Would appreciate if you could share your learning
please? Thanks a ton
Regards,
Pratik Poddar
On Saturday, January 12, 2013 10:20:29 AM UTC+5:30, Aditya Rajgarhia
wrote:
Loïc, thanks for the response.
I am familiar with mlt and am already using it to produce similar
documents from each of my indexes. However, for the particular feature that
I described in the last post, I want to explicitly compare several specific
documents from one index with a specific document from the second index and
get the score for each pair. In other words, I don't want to run a
comparison over every document in one or both indexes since there will be a
large number of documents (millions) in each index. My understanding is
that flt/mlt will do that, unfortunately.
If this is still achievable via flt/mlt, could you elaborate a bit on
how?
Thanks,
Aditya
On Saturday, January 12, 2013 1:14:26 AM UTC+5:30, Loïc Bertron wrote:
I'm trying to build some search features for a website. I don't have
any prior experience with search and decided to go with elasticsearch
mostly because of the ease of use.
I am indexing two types of documents (each with it's own index) since
I want to offer search functionality for either type of document. I have
this part working.
Now, I also want to offer a feature for comparing documents from one
index with those from the other. What I had in mind was that since ES uses
Lucene, I could fetch the term vectors for a pair of documents and then
compute the cosine similarity (as explained in http://sujitpal.blogspot.in/2011/10/computing-document-
similarity-using.html).
However, from what I can tell ES doesn't expose the term vectors. Is
it still possible for me to use ES if I absolutely need the above feature?
Is it possible to read the Lucene index generated by ES directly without
too much trouble?
Of course, I could always generate the term vector dynamically for
each document for the purpose of implementing this particular feature, but
that's inefficient (I will be performing a large number of such
comparisons) and I don't want to do that if there is an alternate -- solr
seems to allow fetching the term vectors
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.