Decay score based on number occurrences

tsturzl · December 9, 2014, 10:40pm

I'm trying to find a way to prevent multiple posts from appearing in search
results that are from the same author. So far I've tried random scoring,
which allows me to maintain pagination. However, I can still have up to 4
of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain
field occurs in the result set? As far as I'm aware you cannot persist a
variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them
have quite a few cons. Such as removing the duplicates, and calling again
to retrieve a new set of results which have the current authors excluded.
However this can also return multiple of the same authors. So I'm left to
query one by one to replace duplicate authors in a result set, and this
breaks deep pagination because eventually the other result set which is
used to replace duplicates runs out of pages before the standard search.
I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a
document based on how many times a document of the same author(or field)
occurs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/89f4676e-3472-4abf-a182-229299d2149f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

softwaredoug · December 10, 2014, 3:13am

Isn't top_hits aggregration pageable? See the "from" parameter listed on
the page:

Certainly you don't want to page through everything (you want scan/scroll
for that), but adequate paging for most search uses.

Or do you want to just eliminate duplicate authors only in one page (ie set
of 10) of results?

-Doug

On Tuesday, December 9, 2014, Travis sturzl travissturzl@gmail.com wrote:

I'm trying to find a way to prevent multiple posts from appearing in
search results that are from the same author. So far I've tried random
scoring, which allows me to maintain pagination. However, I can still have
up to 4 of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain
field occurs in the result set? As far as I'm aware you cannot persist a
variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them
have quite a few cons. Such as removing the duplicates, and calling again
to retrieve a new set of results which have the current authors excluded.
However this can also return multiple of the same authors. So I'm left to
query one by one to replace duplicate authors in a result set, and this
breaks deep pagination because eventually the other result set which is
used to replace duplicates runs out of pages before the standard search.
I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a
document based on how many times a document of the same author(or field)
occurs?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com
<javascript:_e(%7B%7D,'cvml','elasticsearch%2Bunsubscribe@googlegroups.com');>
.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/89f4676e-3472-4abf-a182-229299d2149f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/89f4676e-3472-4abf-a182-229299d2149f%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Doug Turnbull
Search & Big Data Architect
OpenSource Connections http://o19s.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL9JBdZ6aFNA6czCd6%3DqUycC-m63fMree0zh%3DdyPqJ%3DnKQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

tsturzl · December 10, 2014, 3:38am

Doug,

First of all thanks for the reply. I was under the impression that
Aggregations were not page-able, as everything I've read suggested
otherwise. I could be wrong, however our marketing team would like our
posts to rotate, much like random scoring provides, this way each session a
user will see different posts.

The problem that I noticed with pagination in aggregation was, from and
size correlate to hits per bucket, however the number of buckets is
completely variable. So reducing each hit from every bucket. I have about
90 authors, meaning 90 buckets, with 1 result each. I can limit number of
buckets, but I cannot set a "from" value on buckets. I can only define the
max amount of buckets.

I'm a little lost as to how to paginate aggregations for that reason. Also,
I'm only trying to make sure there are none of the same authors per page,
not the entire result set. Deep pagination doesn't have to work, but I'd
also like not having to perform more than 1 query per search/page. Whereas
the only solution I've come up with is calling one by one to replace the
duplicates, which can turn out to mean up to 11 calls. However, some result
sets are only 2-3 pages long, so this may also break pagination for small
result sets.

I'm just having a very difficult time getting my head around this.
Elasticsearch itself doesn't seem to have any feature which can help me
produce this desired outcome.

On Tuesday, December 9, 2014 4:40:35 PM UTC-6, Travis sturzl wrote:

I'm trying to find a way to prevent multiple posts from appearing in
search results that are from the same author. So far I've tried random
scoring, which allows me to maintain pagination. However, I can still have
up to 4 of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain
field occurs in the result set? As far as I'm aware you cannot persist a
variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them
have quite a few cons. Such as removing the duplicates, and calling again
to retrieve a new set of results which have the current authors excluded.
However this can also return multiple of the same authors. So I'm left to
query one by one to replace duplicate authors in a result set, and this
breaks deep pagination because eventually the other result set which is
used to replace duplicates runs out of pages before the standard search.
I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a
document based on how many times a document of the same author(or field)
occurs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c08fcbd8-d502-4526-9995-3ceaef6cb477%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark_Harwood_2 · December 10, 2014, 5:34pm

I have work underway in Lucene and elasticsearch for a "diversified" form
of results collection
: Aggregations: new “Sampler” provides a filter for top-scoring docs by markharwood · Pull Request #8191 · elastic/elasticsearch · GitHub

On Wednesday, December 10, 2014 3:38:47 AM UTC, Travis sturzl wrote:

Doug,

First of all thanks for the reply. I was under the impression that
Aggregations were not page-able, as everything I've read suggested
otherwise. I could be wrong, however our marketing team would like our
posts to rotate, much like random scoring provides, this way each session a
user will see different posts.

The problem that I noticed with pagination in aggregation was, from and
size correlate to hits per bucket, however the number of buckets is
completely variable. So reducing each hit from every bucket. I have about
90 authors, meaning 90 buckets, with 1 result each. I can limit number of
buckets, but I cannot set a "from" value on buckets. I can only define the
max amount of buckets.

I'm a little lost as to how to paginate aggregations for that reason.
Also, I'm only trying to make sure there are none of the same authors per
page, not the entire result set. Deep pagination doesn't have to work, but
I'd also like not having to perform more than 1 query per search/page.
Whereas the only solution I've come up with is calling one by one to
replace the duplicates, which can turn out to mean up to 11 calls. However,
some result sets are only 2-3 pages long, so this may also break pagination
for small result sets.

I'm just having a very difficult time getting my head around this.
Elasticsearch itself doesn't seem to have any feature which can help me
produce this desired outcome.

On Tuesday, December 9, 2014 4:40:35 PM UTC-6, Travis sturzl wrote:

I'm trying to find a way to prevent multiple posts from appearing in
search results that are from the same author. So far I've tried random
scoring, which allows me to maintain pagination. However, I can still have
up to 4 of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain
field occurs in the result set? As far as I'm aware you cannot persist a
variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them
have quite a few cons. Such as removing the duplicates, and calling again
to retrieve a new set of results which have the current authors excluded.
However this can also return multiple of the same authors. So I'm left to
query one by one to replace duplicate authors in a result set, and this
breaks deep pagination because eventually the other result set which is
used to replace duplicates runs out of pages before the standard search.
I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a
document based on how many times a document of the same author(or field)
occurs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fa49e635-d174-4fff-b66d-628f5534f06c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Rescoring documents based on Author occurrence Elasticsearch	1	273	December 8, 2020
Getting consistent scoring best practices Elasticsearch	1	432	December 9, 2019
What are the configurations I should make sure to get consistent results? Elasticsearch	4	339	June 7, 2019
Result Score descending by exact match Elasticsearch	3	602	July 6, 2017
Different scores on replicas with the same documents Elasticsearch	6	2169	July 6, 2017

Decay score based on number occurrences

Related topics