MoreLikeThis query, what does percent_terms_to_match do?

I'm trying out morelikethis
(http://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html) and
it's working well. So easy :slight_smile:

I'm finding entries by related tags so I dropped min_doc_freq to 1 (one
or more tag required for a match) and max_query_terms to 100 (an entry
could be tagged with up to 100 tags) however from the docs it's not clear
to me what percent_terms_to_match does:

The percentage of terms to match on (float value). Defaults to 0.3 (30
percent).

Could someone explain this in other words please, perhaps an example of
what might happen if I increase or decrease from the default? When I try it
on my sample data it increases/reduces the number of hits and doesn't seem
to affect the score of each hit, so I'm just not sure what it's doing.

Thanks.

Effectively, what happens in the more like this query is that it builds a
big boolean query with should clauses for each term. The percent terms to
match means that out of all the terms built, at least X percent should
match (effectively, setting the minimum_should_match parameter on it).

On Mon, May 21, 2012 at 6:30 PM, Nick Dunn nick@nick-dunn.co.uk wrote:

I'm trying out morelikethis (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
and it's working well. So easy :slight_smile:

I'm finding entries by related tags so I dropped min_doc_freq to 1 (one
or more tag required for a match) and max_query_terms to 100 (an entry
could be tagged with up to 100 tags) however from the docs it's not clear
to me what percent_terms_to_match does:

The percentage of terms to match on (float value). Defaults to 0.3 (30
percent).

Could someone explain this in other words please, perhaps an example of
what might happen if I increase or decrease from the default? When I try it
on my sample data it increases/reduces the number of hits and doesn't seem
to affect the score of each hit, so I'm just not sure what it's doing.

Thanks.

Ah that makes sense, thanks Shay.

So increasing percent_terms_to_match to 0.5 means that if I have 20 tags, 50% (10) of these should match in the other document for it to be returned. Increasing the value increases precision, while deceasing it decreases precision but increases recall.

Cheers.

On 23 May 2012, at 23:30, Shay Banon wrote:

Effectively, what happens in the more like this query is that it builds a big boolean query with should clauses for each term. The percent terms to match means that out of all the terms built, at least X percent should match (effectively, setting the minimum_should_match parameter on it).

On Mon, May 21, 2012 at 6:30 PM, Nick Dunn nick@nick-dunn.co.uk wrote:
I'm trying out morelikethis (Elasticsearch Platform — Find real-time answers at scale | Elastic) and it's working well. So easy :slight_smile:

I'm finding entries by related tags so I dropped min_doc_freq to 1 (one or more tag required for a match) and max_query_terms to 100 (an entry could be tagged with up to 100 tags) however from the docs it's not clear to me what percent_terms_to_match does:

The percentage of terms to match on (float value). Defaults to 0.3 (30 percent).

Could someone explain this in other words please, perhaps an example of what might happen if I increase or decrease from the default? When I try it on my sample data it increases/reduces the number of hits and doesn't seem to affect the score of each hit, so I'm just not sure what it's doing.

Thanks.

Yep.

On Thu, May 24, 2012 at 10:31 AM, Nick Dunn nick@nick-dunn.co.uk wrote:

Ah that makes sense, thanks Shay.

So increasing percent_terms_to_match to 0.5 means that if I have 20
tags, 50% (10) of these should match in the other document for it to be
returned. Increasing the value increases precision, while deceasing it
decreases precision but increases recall.

Cheers.

On 23 May 2012, at 23:30, Shay Banon wrote:

Effectively, what happens in the more like this query is that it builds a
big boolean query with should clauses for each term. The percent terms to
match means that out of all the terms built, at least X percent should
match (effectively, setting the minimum_should_match parameter on it).

On Mon, May 21, 2012 at 6:30 PM, Nick Dunn nick@nick-dunn.co.uk wrote:

I'm trying out morelikethis (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
and it's working well. So easy :slight_smile:

I'm finding entries by related tags so I dropped min_doc_freq to 1 (one
or more tag required for a match) and max_query_terms to 100 (an entry
could be tagged with up to 100 tags) however from the docs it's not clear
to me what percent_terms_to_match does:

The percentage of terms to match on (float value). Defaults to 0.3 (30
percent).

Could someone explain this in other words please, perhaps an example of
what might happen if I increase or decrease from the default? When I try it
on my sample data it increases/reduces the number of hits and doesn't seem
to affect the score of each hit, so I'm just not sure what it's doing.

Thanks.