Some of my search results returns a total of over 10k documents, varying from a high score (in my most recent search, ~75) to a very low score (less than 5). Other queries return a high score of ~20 and a low score of ~1.
Does anyone have a good solution for trimming off the less relevant documents? A java or query implementation would work. I've thought about using min_score, but i'm wary of that since it has to be a constant number, and some of the scores of my responses are a lot closer than the above. I suppose I could come up with some formula based off of the returned scores to create a cutoff for every response, but I was curious if anyone has come up with a solution to a similar use case?
In general it is recommended to not do anything like that and just return documents in descending order of score so that the most relevant ones appear first.
Instead of using a score cutoff, the general approach is usually to use a cutoff on the rank and a rescorer. For instance you could take the 10 best documents by relevance and reorder them based on some other criteria that denotes the authority or popularity of the document: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-rescore.html
If you're building a faceted search interface using aggregations it's often useful to do this to avoid a long-tail making a nonsense of your facet summaries. Someone searching for a video by typing ice age shouldn't be told there's 300 matches in the electricals department just because you matched a lot of refrigerators with an ice dispenser.
One technique I've seen used in e-commerce sites is to start with a very tight interpretation of user input e.g. running the input ice age as a strict "ice age" phrase match. Only if the results are very few in number do they re-run a relaxed form of the search i.e. ice OR age. Obviously picking what that magic threshold number is can be tricky and offering users ways to rewrite the query can help.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.