There are many other documents with a description like "Alarm curl
plugin for uWSGI" so I had expected that at least the "Alarm" is a term
that makes it "more-like-that"-style.
I'd welcome a hint what is going wrong here. Thanks.
Kindly
…Christoph
--
A distributed system is one in which I cannot get something done
because a machine I've never heard of is down. (Leslie Lamport)
Note that the default values are designed for a large corpus, not a test
example. In particular, you are butting your head against one or more of
(and this is a guess, but I do have a working implementation of MLT):
percent_terms_to_match
min_term_freq
min_doc_freq
pretty sure you need to set these, not let them default.
There are many other documents with a description like "Alarm curl
plugin for uWSGI" so I had expected that at least the "Alarm" is a term
that makes it "more-like-that"-style.
I'd welcome a hint what is going wrong here. Thanks.
Kindly
…Christoph
--
A distributed system is one in which I cannot get something done
because a machine I've never heard of is down. (Leslie Lamport)
Note that the default values are designed for a large corpus, not a
test example. In particular, you are butting your head against one or
more of (and this is a guess, but I do have a working implementation
of MLT):
percent_terms_to_match
min_term_freq
min_doc_freq
pretty sure you need to set these, not let them default.
Thank you - that was it. I set min_term_freq=1 and received goot
results. My fields are rather short (like "Alarm Clock for GTK
Environments"). So I understand that the default is term_freq=1 which
means that the term "Alarm" would have to occur at least twice before it
would make another document "related".
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.