Writing a custom_score or custom_filters_score query based on field value frequency

I would like to apologize ahead of time because this is a difficult situation to explain.

Say I have a field named "type" on every one of my documents. That type can be "Book", "DVD", or "Photo".

I then run a query on all items, let's say it is James Bond.

I will get back results that are mostly DVDs, and the Books, maybe even Photos, get pushed to the back due to having a lower score. Who knows why, maybe the rest of the document's content doesn't mention James Bond as much as the DVDs do. The goal is to mix in the top-scoring Books and Photos in with the top-scoring DVDs so that we end up with a diverse result set. So basically, lower the score of the more common result (DVD), and/or raise the score of the less common result (Book, Photo).

I want to write a query so that once I find a document with the type "DVD", I make the next document with type "DVD" have a lower score, and the next one have even lower. The problem is that I don't want to do this at index time, because I want it to vary based on the results of a query (If the query comes back with 75 DVDs, the last DVD will have a much lower score than if the query comes back with 25 DVDs)

I've thought of two possible ways of going about this, but I am not sure how to implement it.

One is to somehow use the custom_filters_score or custom_score query to get the current facet count on the "type" field. Or to somehow store that count per document found? I have no way of really explaining this better than that. But if I had that information somehow, I could write a custom_score query that inversely affects the document's score based on how many times we've seen the same "type".

The other option is to alter the score of the results based on the facet counts for a facet based on the "type" field. Basically, inversely affecting the documents with that "type" based on how high the count is. For example, if DVDs has 75 results, and Books has 2, alter the score of all DVD items by 1/75, and alter the score of all Books by 1/2. This is just a broad example, these numbers wouldn't work perfectly.

I had already planned on doing the 2nd possible solution with two queries: one to get the facets for the query, and the second would use that information and send the altered boosts in custom_score queries. But I would ideally like to do this all in one query.

Any ideas? I know it is a lot to swallow, but it would give the search user a larger variety of options.

Hi Jim,
I came across your post while searching for any requirements for diversity
in results.

I'm working on an approach that allows you to limit results from any one
choice of field (in your case "type").
Using this approach, all of the results are still selected on their
individual merits (keeping their natural score) but there's an additional
rule applied that you can only have N of any one type. The thinking behind
this logic is described
here: https://issues.apache.org/jira/browse/LUCENE-6066?focusedCommentId=14219901&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14219901

Would this approach work in your case?

On Saturday, March 16, 2013 6:20:52 AM UTC, Jim wrote:

I would like to apologize ahead of time because this is a difficult
situation
to explain.

Say I have a field named "type" on every one of my documents. That type
can
be "Book", "DVD", or "Photo".

I then run a query on all items, let's say it is James Bond.

I will get back results that are mostly DVDs, and the Books, maybe even
Photos, get pushed to the back due to having a lower score. Who knows
why,
maybe the rest of the document's content doesn't mention James Bond as
much
as the DVDs do. The goal is to mix in the top-scoring Books and Photos in
with the top-scoring DVDs so that we end up with a diverse result set. So
basically, lower the score of the more common result (DVD), and/or raise
the
score of the less common result (Book, Photo).

I want to write a query so that once I find a document with the type
"DVD",
I make the next document with type "DVD" have a lower score, and the next
one have even lower. The problem is that I don't want to do this at index
time, because I want it to vary based on the results of a query (If the
query comes back with 75 DVDs, the last DVD will have a much lower score
than if the query comes back with 25 DVDs)

I've thought of two possible ways of going about this, but I am not sure
how
to implement it.

One is to somehow use the custom_filters_score or custom_score query to
get
the current facet count on the "type" field. Or to somehow store that
count
per document found? I have no way of really explaining this better than
that. But if I had that information somehow, I could write a custom_score
query that inversely affects the document's score based on how many times
we've seen the same "type".

The other option is to alter the score of the results based on the facet
counts for a facet based on the "type" field. Basically, inversely
affecting the documents with that "type" based on how high the count is.
For example, if DVDs has 75 results, and Books has 2, alter the score of
all
DVD items by 1/75, and alter the score of all Books by 1/2. This is just
a
broad example, these numbers wouldn't work perfectly.

I had already planned on doing the 2nd possible solution with two queries:
one to get the facets for the query, and the second would use that
information and send the altered boosts in custom_score queries. But I
would ideally like to do this all in one query.

Any ideas? I know it is a lot to swallow, but it would give the search
user
a larger variety of options.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Writing-a-custom-score-or-custom-filters-score-query-based-on-field-value-frequency-tp4031780.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/af059948-9bec-4136-a24a-8bcd988f7deb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.