I have a requirement to retrieve similar documents, but not too much similar (because we have in database many almost identical documents, which we want to skip in search).
I looking into "More Like This" documentation, and I can't find something to limit similarity rating. How I can achieve that (e.g. finding similar documents, but not too much similar? I would like to have max similarity of 80%)?
Not sure about your requirement, but maybe the max_doc_freq parameter might help you a little bit, though it was supposed to filter out stopwords mainly.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.