I needed to precompute a special similarity measure between each pairwise
of documents.
Now I would to understand how to index and search using ES to answer a
query like
"Retrieve me the Top N documents that are the most similar to document
ID 1 and having as fieldA = 1"
and facets the results according to a given field
--
I was thinking to create an index of documnts with all the associated
pairwises as attributes,like:
issue with such design
The number of generated fields per document is very large for me (10K)
and I am not sure how to search efficiently (I tried a script score like
return doc['sim_doc_id1'] + field1) but it was quite slow.. especially
compared to a stupid loop in Java. however I would to use the aggregation
framework of ES to create facets of the results.
Do you have any recommendation / guideline to handle this pb?
Not elasticsearch, but I'm doing a similar thing using Redis. I started out
with the recommendify ruby gem and then wrote my own, commendo. http://rubygems.org/gems/commendo
We're using it for production pairwise comparison of about 30,000 resources
at Meducation. Both visit-based similarity and content-based similarity. It
implements Jaccard now and could be extended.
I needed to precompute a special similarity measure between each pairwise
of documents.
Now I would to understand how to index and search using ES to answer a
query like
"Retrieve me the Top N documents that are the most similar to document
ID 1 and having as fieldA = 1"
and facets the results according to a given field
--
I was thinking to create an index of documnts with all the associated
pairwises as attributes,like:
issue with such design
The number of generated fields per document is very large for me (10K)
and I am not sure how to search efficiently (I tried a script score
like return doc['sim_doc_id1'] + field1) but it was quite slow..
especially compared to a stupid loop in Java. however I would to use the
aggregation framework of ES to create facets of the results.
Do you have any recommendation / guideline to handle this pb?
We're using it for production pairwise comparison of about 30,000
resources at Meducation. Both visit-based similarity and content-based
similarity. It implements Jaccard now and could be extended.
I needed to precompute a special similarity measure between each
pairwise of documents.
Now I would to understand how to index and search using ES to answer a
query like
"Retrieve me the Top N documents that are the most similar to
document ID 1 and having as fieldA = 1"
and facets the results according to a given field
--
I was thinking to create an index of documnts with all the associated
pairwises as attributes,like:
issue with such design
The number of generated fields per document is very large for me (10K)
and I am not sure how to search efficiently (I tried a script score
like return doc['sim_doc_id1'] + field1) but it was quite slow..
especially compared to a stupid loop in Java. however I would to use the
aggregation framework of ES to create facets of the results.
Do you have any recommendation / guideline to handle this pb?
I've done something similar, but I use a single field for the related
items; like this:
id: 1
related: 2 3 9 100
category: xxx
id: 2
related: 1 9 88
category: uuu xxx
etc...
You can limit the related items to the top N if you sort them by score
when indexing and truncate the list. I don't model the relative scores
in Lucene but you could do that in a gross way by repeating terms
-Mike
On Friday, April 25, 2014 6:09:13 PM UTC-4, NM wrote:
I have N documents containing attributes.
I needed to precompute a special similarity measure between each pairwise
of documents.
Now I would to understand how to index and search using ES to answer a
query like
"Retrieve me the Top N documents that are the most similar to document
ID 1 and having as fieldA = 1"
and facets the results according to a given field
--
I was thinking to create an index of documnts with all the associated
pairwises as attributes,like:
issue with such design
The number of generated fields per document is very large for me (10K)
and I am not sure how to search efficiently (I tried a script score
like return doc['sim_doc_id1'] + field1) but it was quite slow..
especially compared to a stupid loop in Java. however I would to use the
aggregation framework of ES to create facets of the results.
Do you have any recommendation / guideline to handle this pb?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.