I have query_string
queries made up of 4 parts like this:
(topic2_term1 OR topic2_term2 OR topic2_term3) AND
(topic3_term1 OR topic3_term2) AND
rare_term
I'm just querying a short title and longer content text field using the default bm25 model.
Typically topic1's terms are really popular in the corpus, topic2's less so, topic3's even less, and the final query term occurs infrequently in the corpus. The behavior I'm getting is that many of my highest scoring documents end up being about 2 or 3 of the topics, but barely mention the others.
I think what I'm trying to do here is prioritize documents that mention each of these subqueries equally. I don't want a document that is primarily about topic1 and just happens to mention the rare term once somewhere in the content.
Can someone suggest a way to go about this?
Right now I'm breaking sending each subquery into a filters
agg to get doc counts for each one.
"aggregations": {
"subs": {
"buckets": {
"topic1": {
"doc_count": 13335846
},
"rare_term": {
"doc_count": 225146
},
"topic2": {
"doc_count": 1726988
},
"topic3": {
"doc_count": 35396026
}
}
}
}
Then I'm using that to try two different approaches:
- Use the difference between a given subquery and the most popular subquery as a boost. So if the most popular topic's terms occur 157x more than the rare term, I use
rare_term^157
in the originalquery_string
. - Rescore the original query 4 times, starting with the rare_term. The idea here would be to take the top 1000 based on bm25 and then find the top scoring docs among them for the rare_term, and so on.
Is there a third, better way?
Possibly related topics: