I have documents which have a "category" : ["cat1", "cat2", "cat3"] etc. Some documents may have 20 or so different categories in the array, some may only have 1. I am trying to construct a query that does the following. Given a filter say ["cat1", "cat3", "cat5"] I would get all docs that contain all 3 of those categories (in any order and I'm not concerned if those docs also contain other categories), then the docs where only two of the 3 are included, then those docs that have any 1 of the 3 categories, and then finally followed by all documents that didn't contain any of the 3 categories in the request.
I'm getting close but my query still seems to show docs with no hits before docs with at least one hit. Any help would be greatly appreciated.
Ok, well that doesn't work because it starts excluding documents that don't have both matching. My real question is ultimately about sorting. I may have 25 documents with different categories and I still want to show all 25, however, I want those documents that contain both categories to score highest, then those docs with just 1 of the 2 categories score next highest, and then the rest of the docs really in any order.
For example, this query insists on ranking docs that don't contain any of the categories higher or with the same score as documents that just contain 1 of the categories.
You need to boost the relevance of the documents which match 2 fields so they score more than the docs which match just one.
In your results segment above, the docs matching just one 'allergen' score the same as docs matching 3 ( "_score": 7.0610676,). If you boost the results from the 'match all 3' section by a factor of 100, then the results that match just 2 by a factor of 50, then just 1 match by a factor of 10, then the results should by properly weighted with the more relevant documents appearing first.
Wrapping each compound query in a constant_score_query, will allow you to set the relevance score for all the documents in that query to whatever value you want.
Again all theoretical and untested on my part.
Hope it helps.
Actually what may not be apparent is that the document in the middle actually doesn't match any of the query terms yes is scored equally with the one that matches one. This just seems wrong to me given the way the query is currently structured.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.