More like this scoring algorithm unclear


(Maarten Roosendaal) #1

Hi,

I have a question about why the 'more like this' algorithm scores documents
higher than others, while they are (at first glance) the same.

What i've done is index wishlist-documents which contain 1 property:
product_id, this property contains an array of product_id's (e.g. [1234,
4444, 5555, 6666]. What i'm trying to do is find similair wishlist for a
given wishlist with id x. The MLT API seems to work, it returns other
documents which contain at least 1 of the product_id's from the original
list.

But what is see is that, for example. i get 10 hits, the first 6 hits
contain the same (and only 1) product_id, this product_id is present in the
original wishlist. What i would expect is that the score of the first 6 is
the same. However what i see is that only the first 2 have the same, the
next 2 a lower score and the next 2 even lower. Why is this?

Also, i'm trying to write the MLT API as an MLT query, but somehow it
doesn't work. I would expect that i need to take the entire content of the
original product_id property and feed is as input for the 'like_text'. The
documentation is not very clear and doesn't provide examples so i'm a
little lost.

Hope someone can give some pointers.

Thanks,
Maarten

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0e2827b2-5a21-4cff-b773-ebdd861c5972%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Justin Treher) #2

Hey Maarten,

I would use the "explain":true option to see just why your documents are
being scored higher than others. MoreLikeThis using the same fulltext
scoring as far as I know, so term position would affect score.

http://lucene.apache.org/core/3_0_3/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html

Justin

On Wednesday, January 8, 2014 3:04:47 AM UTC-5, Maarten Roosendaal wrote:

Hi,

I have a question about why the 'more like this' algorithm scores
documents higher than others, while they are (at first glance) the same.

What i've done is index wishlist-documents which contain 1 property:
product_id, this property contains an array of product_id's (e.g. [1234,
4444, 5555, 6666]. What i'm trying to do is find similair wishlist for a
given wishlist with id x. The MLT API seems to work, it returns other
documents which contain at least 1 of the product_id's from the original
list.

But what is see is that, for example. i get 10 hits, the first 6 hits
contain the same (and only 1) product_id, this product_id is present in the
original wishlist. What i would expect is that the score of the first 6 is
the same. However what i see is that only the first 2 have the same, the
next 2 a lower score and the next 2 even lower. Why is this?

Also, i'm trying to write the MLT API as an MLT query, but somehow it
doesn't work. I would expect that i need to take the entire content of the
original product_id property and feed is as input for the 'like_text'. The
documentation is not very clear and doesn't provide examples so i'm a
little lost.

Hope someone can give some pointers.

Thanks,
Maarten

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a0e9a58d-89e7-4084-b7ed-7f34c8514ce5%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Maarten Roosendaal) #3

Hi,

Thanks, i'm not quite sure how to do that. I'm using:
http://localhost:9200/lists/list/[id of
list]/_mlt?mlt_field=product_id&min_term_freq=1&min_doc_freq=1

the body does not seem to be respected (i'm using the elasticsearch head
plugin) if i ad:
{
"explain": true
}

i've been trying to rewrite the mlt api as an mlt query but no luck so far.
Any suggestions?

Thanks,
Maarten

Op woensdag 8 januari 2014 16:14:25 UTC+1 schreef Justin Treher:

Hey Maarten,

I would use the "explain":true option to see just why your documents are
being scored higher than others. MoreLikeThis using the same fulltext
scoring as far as I know, so term position would affect score.

http://lucene.apache.org/core/3_0_3/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html

Justin

On Wednesday, January 8, 2014 3:04:47 AM UTC-5, Maarten Roosendaal wrote:

Hi,

I have a question about why the 'more like this' algorithm scores
documents higher than others, while they are (at first glance) the same.

What i've done is index wishlist-documents which contain 1 property:
product_id, this property contains an array of product_id's (e.g. [1234,
4444, 5555, 6666]. What i'm trying to do is find similair wishlist for a
given wishlist with id x. The MLT API seems to work, it returns other
documents which contain at least 1 of the product_id's from the original
list.

But what is see is that, for example. i get 10 hits, the first 6 hits
contain the same (and only 1) product_id, this product_id is present in the
original wishlist. What i would expect is that the score of the first 6 is
the same. However what i see is that only the first 2 have the same, the
next 2 a lower score and the next 2 even lower. Why is this?

Also, i'm trying to write the MLT API as an MLT query, but somehow it
doesn't work. I would expect that i need to take the entire content of the
original product_id property and feed is as input for the 'like_text'. The
documentation is not very clear and doesn't provide examples so i'm a
little lost.

Hope someone can give some pointers.

Thanks,
Maarten

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5f1b4a50-8862-42e8-a3a8-532f88757a48%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Maarten Roosendaal) #4

scoring algorithm is still vague but i got the query to act like the API,
although the results are different so i'm still doing it wrong, here's an
example:
{
"explain": true,
"query": {
"more_like_this": {
"fields": [
"PRODUCT_ID"
],
"like_text": "1000004004855475 1001004002067765 1002004000094210
1002004004499883",
"min_term_freq": 1,
"min_doc_freq": 1,
"max_query_terms": 1,
"percent_terms_to_match": 0.5
}
},
"from": 0,
"size": 50,
"sort": [],
"facets": {}
}

the like_text contains product_id's from a wishlist for which i want to
find similair lists

Op woensdag 8 januari 2014 16:50:53 UTC+1 schreef Maarten Roosendaal:

Hi,

Thanks, i'm not quite sure how to do that. I'm using:
http://localhost:9200/lists/list/[id of
list]/_mlt?mlt_field=product_id&min_term_freq=1&min_doc_freq=1

the body does not seem to be respected (i'm using the elasticsearch head
plugin) if i ad:
{
"explain": true
}

i've been trying to rewrite the mlt api as an mlt query but no luck so
far. Any suggestions?

Thanks,
Maarten

Op woensdag 8 januari 2014 16:14:25 UTC+1 schreef Justin Treher:

Hey Maarten,

I would use the "explain":true option to see just why your documents are
being scored higher than others. MoreLikeThis using the same fulltext
scoring as far as I know, so term position would affect score.

http://lucene.apache.org/core/3_0_3/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html

Justin

On Wednesday, January 8, 2014 3:04:47 AM UTC-5, Maarten Roosendaal wrote:

Hi,

I have a question about why the 'more like this' algorithm scores
documents higher than others, while they are (at first glance) the same.

What i've done is index wishlist-documents which contain 1 property:
product_id, this property contains an array of product_id's (e.g. [1234,
4444, 5555, 6666]. What i'm trying to do is find similair wishlist for a
given wishlist with id x. The MLT API seems to work, it returns other
documents which contain at least 1 of the product_id's from the original
list.

But what is see is that, for example. i get 10 hits, the first 6 hits
contain the same (and only 1) product_id, this product_id is present in the
original wishlist. What i would expect is that the score of the first 6 is
the same. However what i see is that only the first 2 have the same, the
next 2 a lower score and the next 2 even lower. Why is this?

Also, i'm trying to write the MLT API as an MLT query, but somehow it
doesn't work. I would expect that i need to take the entire content of the
original product_id property and feed is as input for the 'like_text'. The
documentation is not very clear and doesn't provide examples so i'm a
little lost.

Hope someone can give some pointers.

Thanks,
Maarten

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c7032391-2456-47a0-a3b8-1f5fe61127e7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alex Ksikes) #5

Hi Maarten,

Your 'like_text' is analyzed, the same way your 'product_id' field is
analyzed, unless specified by 'analyzer'. I would recommend setting
'percent_terms_to_match' to 0. However, if you are only searching over
product ids then a simple boolean query would do. If not, then I would
create a boolean query where each clause is a 'more like this field' for
each field of the queried document. This is actually what the mlt API does.

Cheers,

Alex

On Wednesday, January 8, 2014 7:20:05 PM UTC+1, Maarten Roosendaal wrote:

scoring algorithm is still vague but i got the query to act like the API,
although the results are different so i'm still doing it wrong, here's an
example:
{
"explain": true,
"query": {
"more_like_this": {
"fields": [
"PRODUCT_ID"
],
"like_text": "1000004004855475 1001004002067765 1002004000094210
1002004004499883",
"min_term_freq": 1,
"min_doc_freq": 1,
"max_query_terms": 1,
"percent_terms_to_match": 0.5
}
},
"from": 0,
"size": 50,
"sort": [],
"facets": {}
}

the like_text contains product_id's from a wishlist for which i want to
find similair lists

Op woensdag 8 januari 2014 16:50:53 UTC+1 schreef Maarten Roosendaal:

Hi,

Thanks, i'm not quite sure how to do that. I'm using:
http://localhost:9200/lists/list/[id of
list]/_mlt?mlt_field=product_id&min_term_freq=1&min_doc_freq=1

the body does not seem to be respected (i'm using the elasticsearch head
plugin) if i ad:
{
"explain": true
}

i've been trying to rewrite the mlt api as an mlt query but no luck so
far. Any suggestions?

Thanks,
Maarten

Op woensdag 8 januari 2014 16:14:25 UTC+1 schreef Justin Treher:

Hey Maarten,

I would use the "explain":true option to see just why your documents are
being scored higher than others. MoreLikeThis using the same fulltext
scoring as far as I know, so term position would affect score.

http://lucene.apache.org/core/3_0_3/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html

Justin

On Wednesday, January 8, 2014 3:04:47 AM UTC-5, Maarten Roosendaal wrote:

Hi,

I have a question about why the 'more like this' algorithm scores
documents higher than others, while they are (at first glance) the same.

What i've done is index wishlist-documents which contain 1 property:
product_id, this property contains an array of product_id's (e.g. [1234,
4444, 5555, 6666]. What i'm trying to do is find similair wishlist for a
given wishlist with id x. The MLT API seems to work, it returns other
documents which contain at least 1 of the product_id's from the original
list.

But what is see is that, for example. i get 10 hits, the first 6 hits
contain the same (and only 1) product_id, this product_id is present in the
original wishlist. What i would expect is that the score of the first 6 is
the same. However what i see is that only the first 2 have the same, the
next 2 a lower score and the next 2 even lower. Why is this?

Also, i'm trying to write the MLT API as an MLT query, but somehow it
doesn't work. I would expect that i need to take the entire content of the
original product_id property and feed is as input for the 'like_text'. The
documentation is not very clear and doesn't provide examples so i'm a
little lost.

Hope someone can give some pointers.

Thanks,
Maarten

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/91734252-74d0-4001-becc-a184af0f2997%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6