I'm finding entries by related tags so I dropped min_doc_freq to 1 (one
or more tag required for a match) and max_query_terms to 100 (an entry
could be tagged with up to 100 tags) however from the docs it's not clear
to me what percent_terms_to_match does:
The percentage of terms to match on (float value). Defaults to 0.3 (30
percent).
Could someone explain this in other words please, perhaps an example of
what might happen if I increase or decrease from the default? When I try it
on my sample data it increases/reduces the number of hits and doesn't seem
to affect the score of each hit, so I'm just not sure what it's doing.
Effectively, what happens in the more like this query is that it builds a
big boolean query with should clauses for each term. The percent terms to
match means that out of all the terms built, at least X percent should
match (effectively, setting the minimum_should_match parameter on it).
I'm finding entries by related tags so I dropped min_doc_freq to 1 (one
or more tag required for a match) and max_query_terms to 100 (an entry
could be tagged with up to 100 tags) however from the docs it's not clear
to me what percent_terms_to_match does:
The percentage of terms to match on (float value). Defaults to 0.3 (30
percent).
Could someone explain this in other words please, perhaps an example of
what might happen if I increase or decrease from the default? When I try it
on my sample data it increases/reduces the number of hits and doesn't seem
to affect the score of each hit, so I'm just not sure what it's doing.
So increasing percent_terms_to_match to 0.5 means that if I have 20 tags, 50% (10) of these should match in the other document for it to be returned. Increasing the value increases precision, while deceasing it decreases precision but increases recall.
Cheers.
On 23 May 2012, at 23:30, Shay Banon wrote:
Effectively, what happens in the more like this query is that it builds a big boolean query with should clauses for each term. The percent terms to match means that out of all the terms built, at least X percent should match (effectively, setting the minimum_should_match parameter on it).
I'm finding entries by related tags so I dropped min_doc_freq to 1 (one or more tag required for a match) and max_query_terms to 100 (an entry could be tagged with up to 100 tags) however from the docs it's not clear to me what percent_terms_to_match does:
The percentage of terms to match on (float value). Defaults to 0.3 (30 percent).
Could someone explain this in other words please, perhaps an example of what might happen if I increase or decrease from the default? When I try it on my sample data it increases/reduces the number of hits and doesn't seem to affect the score of each hit, so I'm just not sure what it's doing.
So increasing percent_terms_to_match to 0.5 means that if I have 20
tags, 50% (10) of these should match in the other document for it to be
returned. Increasing the value increases precision, while deceasing it
decreases precision but increases recall.
Cheers.
On 23 May 2012, at 23:30, Shay Banon wrote:
Effectively, what happens in the more like this query is that it builds a
big boolean query with should clauses for each term. The percent terms to
match means that out of all the terms built, at least X percent should
match (effectively, setting the minimum_should_match parameter on it).
I'm finding entries by related tags so I dropped min_doc_freq to 1 (one
or more tag required for a match) and max_query_terms to 100 (an entry
could be tagged with up to 100 tags) however from the docs it's not clear
to me what percent_terms_to_match does:
The percentage of terms to match on (float value). Defaults to 0.3 (30
percent).
Could someone explain this in other words please, perhaps an example of
what might happen if I increase or decrease from the default? When I try it
on my sample data it increases/reduces the number of hits and doesn't seem
to affect the score of each hit, so I'm just not sure what it's doing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.