Indexing a large Nx N matrix of similarity with ES


(nicolas) #1

I have N documents containing attributes.

I needed to precompute a special similarity measure between each pairwise
of documents.

Now I would to understand how to index and search using ES to answer a
query like

"Retrieve me the Top N documents that are the most similar to document
ID 1 and having as fieldA = 1"
and facets the results according to a given field

--

I was thinking to create an index of documnts with all the associated
pairwises as attributes,like:

Doc
id: 1
field1: 7
field2: 10
sim_doc_id2: 10
sim_doc_id3: 8
sim_doc_id4: 12
...
sim_doc_idN: 12

Doc
id: 2
field1: 5
field2: 2
sim_doc_id1: 10
sim_doc_id3: 3
sim_doc_id4: 2
...
sim_doc_idN: 10
..

issue with such design
The number of generated fields per document is very large for me (10K)
and I am not sure how to search efficiently (I tried a script score like
return doc['sim_doc_id1'] + field1) but it was quite slow.. especially
compared to a stupid loop in Java. however I would to use the aggregation
framework of ES to create facets of the results.

Do you have any recommendation / guideline to handle this pb?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4e5f22d6-4f0a-4739-92c8-8b2e85885a6f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Rob Styles) #2

Not elasticsearch, but I'm doing a similar thing using Redis. I started out
with the recommendify ruby gem and then wrote my own, commendo.
http://rubygems.org/gems/commendo

We're using it for production pairwise comparison of about 30,000 resources
at Meducation. Both visit-based similarity and content-based similarity. It
implements Jaccard now and could be extended.

hth,

rob

On 25 April 2014 23:09, NM n.maisonneuve@gmail.com wrote:

I have N documents containing attributes.

I needed to precompute a special similarity measure between each pairwise
of documents.

Now I would to understand how to index and search using ES to answer a
query like

"Retrieve me the Top N documents that are the most similar to document
ID 1 and having as fieldA = 1"
and facets the results according to a given field

--

I was thinking to create an index of documnts with all the associated
pairwises as attributes,like:

Doc
id: 1
field1: 7
field2: 10
sim_doc_id2: 10
sim_doc_id3: 8
sim_doc_id4: 12
...
sim_doc_idN: 12

Doc
id: 2
field1: 5
field2: 2
sim_doc_id1: 10
sim_doc_id3: 3
sim_doc_id4: 2
...
sim_doc_idN: 10
..

issue with such design
The number of generated fields per document is very large for me (10K)
and I am not sure how to search efficiently (I tried a script score
like return doc['sim_doc_id1'] + field1) but it was quite slow..
especially compared to a stupid loop in Java. however I would to use the
aggregation framework of ES to create facets of the results.

Do you have any recommendation / guideline to handle this pb?

Thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4e5f22d6-4f0a-4739-92c8-8b2e85885a6f%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/4e5f22d6-4f0a-4739-92c8-8b2e85885a6f%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAORoscnc2Py%2BKRDB0jED0M3%2B6vNPvuVekQz8WoL-eb7aj-Yy_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(nicolas) #3

thanks Rob, might help

but redis doesn't have the aggregation/facet features of ES that I would
like to use.

do someone else has some insight about solving this issue with
ElasticSearch?

Le samedi 26 avril 2014 00:25:23 UTC+2, Rob Styles a écrit :

Not elasticsearch, but I'm doing a similar thing using Redis. I started
out with the recommendify ruby gem and then wrote my own, commendo.
http://rubygems.org/gems/commendo

We're using it for production pairwise comparison of about 30,000
resources at Meducation. Both visit-based similarity and content-based
similarity. It implements Jaccard now and could be extended.

hth,

rob

On 25 April 2014 23:09, NM <n.mais...@gmail.com <javascript:>> wrote:

I have N documents containing attributes.

I needed to precompute a special similarity measure between each
pairwise of documents.

Now I would to understand how to index and search using ES to answer a
query like

"Retrieve me the Top N documents that are the most similar to
document ID 1 and having as fieldA = 1"
and facets the results according to a given field

--

I was thinking to create an index of documnts with all the associated
pairwises as attributes,like:

Doc
id: 1
field1: 7
field2: 10
sim_doc_id2: 10
sim_doc_id3: 8
sim_doc_id4: 12
...
sim_doc_idN: 12

Doc
id: 2
field1: 5
field2: 2
sim_doc_id1: 10
sim_doc_id3: 3
sim_doc_id4: 2
...
sim_doc_idN: 10
..

issue with such design
The number of generated fields per document is very large for me (10K)
and I am not sure how to search efficiently (I tried a script score
like return doc['sim_doc_id1'] + field1) but it was quite slow..
especially compared to a stupid loop in Java. however I would to use the
aggregation framework of ES to create facets of the results.

Do you have any recommendation / guideline to handle this pb?

Thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4e5f22d6-4f0a-4739-92c8-8b2e85885a6f%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/4e5f22d6-4f0a-4739-92c8-8b2e85885a6f%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/06075c2b-a1e9-45ed-9cd6-5bee49599bb3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Michael Sokolov) #4

I've done something similar, but I use a single field for the related
items; like this:

id: 1
related: 2 3 9 100
category: xxx

id: 2
related: 1 9 88
category: uuu xxx

etc...

You can limit the related items to the top N if you sort them by score
when indexing and truncate the list. I don't model the relative scores
in Lucene but you could do that in a gross way by repeating terms

-Mike

On Friday, April 25, 2014 6:09:13 PM UTC-4, NM wrote:

I have N documents containing attributes.

I needed to precompute a special similarity measure between each pairwise
of documents.

Now I would to understand how to index and search using ES to answer a
query like

"Retrieve me the Top N documents that are the most similar to document
ID 1 and having as fieldA = 1"
and facets the results according to a given field

--

I was thinking to create an index of documnts with all the associated
pairwises as attributes,like:

Doc
id: 1
field1: 7
field2: 10
sim_doc_id2: 10
sim_doc_id3: 8
sim_doc_id4: 12
...
sim_doc_idN: 12

Doc
id: 2
field1: 5
field2: 2
sim_doc_id1: 10
sim_doc_id3: 3
sim_doc_id4: 2
...
sim_doc_idN: 10
..

issue with such design
The number of generated fields per document is very large for me (10K)
and I am not sure how to search efficiently (I tried a script score
like return doc['sim_doc_id1'] + field1) but it was quite slow..
especially compared to a stupid loop in Java. however I would to use the
aggregation framework of ES to create facets of the results.

Do you have any recommendation / guideline to handle this pb?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da3b3180-4d62-4119-87f0-6a415f72f0d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5