Decay score based on number occurrences

I'm trying to find a way to prevent multiple posts from appearing in search
results that are from the same author. So far I've tried random scoring,
which allows me to maintain pagination. However, I can still have up to 4
of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain
field occurs in the result set? As far as I'm aware you cannot persist a
variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them
have quite a few cons. Such as removing the duplicates, and calling again
to retrieve a new set of results which have the current authors excluded.
However this can also return multiple of the same authors. So I'm left to
query one by one to replace duplicate authors in a result set, and this
breaks deep pagination because eventually the other result set which is
used to replace duplicates runs out of pages before the standard search.
I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a
document based on how many times a document of the same author(or field)
occurs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/89f4676e-3472-4abf-a182-229299d2149f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Isn't top_hits aggregration pageable? See the "from" parameter listed on
the page:

Certainly you don't want to page through everything (you want scan/scroll
for that), but adequate paging for most search uses.

Or do you want to just eliminate duplicate authors only in one page (ie set
of 10) of results?

-Doug

On Tuesday, December 9, 2014, Travis sturzl travissturzl@gmail.com wrote:

I'm trying to find a way to prevent multiple posts from appearing in
search results that are from the same author. So far I've tried random
scoring, which allows me to maintain pagination. However, I can still have
up to 4 of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain
field occurs in the result set? As far as I'm aware you cannot persist a
variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them
have quite a few cons. Such as removing the duplicates, and calling again
to retrieve a new set of results which have the current authors excluded.
However this can also return multiple of the same authors. So I'm left to
query one by one to replace duplicate authors in a result set, and this
breaks deep pagination because eventually the other result set which is
used to replace duplicates runs out of pages before the standard search.
I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a
document based on how many times a document of the same author(or field)
occurs?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com
<javascript:_e(%7B%7D,'cvml','elasticsearch%2Bunsubscribe@googlegroups.com');>
.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/89f4676e-3472-4abf-a182-229299d2149f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/89f4676e-3472-4abf-a182-229299d2149f%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Doug Turnbull
Search & Big Data Architect
OpenSource Connections http://o19s.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL9JBdZ6aFNA6czCd6%3DqUycC-m63fMree0zh%3DdyPqJ%3DnKQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Doug,

First of all thanks for the reply. I was under the impression that
Aggregations were not page-able, as everything I've read suggested
otherwise. I could be wrong, however our marketing team would like our
posts to rotate, much like random scoring provides, this way each session a
user will see different posts.

The problem that I noticed with pagination in aggregation was, from and
size correlate to hits per bucket, however the number of buckets is
completely variable. So reducing each hit from every bucket. I have about
90 authors, meaning 90 buckets, with 1 result each. I can limit number of
buckets, but I cannot set a "from" value on buckets. I can only define the
max amount of buckets.

I'm a little lost as to how to paginate aggregations for that reason. Also,
I'm only trying to make sure there are none of the same authors per page,
not the entire result set. Deep pagination doesn't have to work, but I'd
also like not having to perform more than 1 query per search/page. Whereas
the only solution I've come up with is calling one by one to replace the
duplicates, which can turn out to mean up to 11 calls. However, some result
sets are only 2-3 pages long, so this may also break pagination for small
result sets.

I'm just having a very difficult time getting my head around this.
Elasticsearch itself doesn't seem to have any feature which can help me
produce this desired outcome.

On Tuesday, December 9, 2014 4:40:35 PM UTC-6, Travis sturzl wrote:

I'm trying to find a way to prevent multiple posts from appearing in
search results that are from the same author. So far I've tried random
scoring, which allows me to maintain pagination. However, I can still have
up to 4 of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain
field occurs in the result set? As far as I'm aware you cannot persist a
variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them
have quite a few cons. Such as removing the duplicates, and calling again
to retrieve a new set of results which have the current authors excluded.
However this can also return multiple of the same authors. So I'm left to
query one by one to replace duplicate authors in a result set, and this
breaks deep pagination because eventually the other result set which is
used to replace duplicates runs out of pages before the standard search.
I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a
document based on how many times a document of the same author(or field)
occurs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c08fcbd8-d502-4526-9995-3ceaef6cb477%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I have work underway in Lucene and elasticsearch for a "diversified" form
of results collection
: Aggregations: new “Sampler” provides a filter for top-scoring docs by markharwood · Pull Request #8191 · elastic/elasticsearch · GitHub

On Wednesday, December 10, 2014 3:38:47 AM UTC, Travis sturzl wrote:

Doug,

First of all thanks for the reply. I was under the impression that
Aggregations were not page-able, as everything I've read suggested
otherwise. I could be wrong, however our marketing team would like our
posts to rotate, much like random scoring provides, this way each session a
user will see different posts.

The problem that I noticed with pagination in aggregation was, from and
size correlate to hits per bucket, however the number of buckets is
completely variable. So reducing each hit from every bucket. I have about
90 authors, meaning 90 buckets, with 1 result each. I can limit number of
buckets, but I cannot set a "from" value on buckets. I can only define the
max amount of buckets.

I'm a little lost as to how to paginate aggregations for that reason.
Also, I'm only trying to make sure there are none of the same authors per
page, not the entire result set. Deep pagination doesn't have to work, but
I'd also like not having to perform more than 1 query per search/page.
Whereas the only solution I've come up with is calling one by one to
replace the duplicates, which can turn out to mean up to 11 calls. However,
some result sets are only 2-3 pages long, so this may also break pagination
for small result sets.

I'm just having a very difficult time getting my head around this.
Elasticsearch itself doesn't seem to have any feature which can help me
produce this desired outcome.

On Tuesday, December 9, 2014 4:40:35 PM UTC-6, Travis sturzl wrote:

I'm trying to find a way to prevent multiple posts from appearing in
search results that are from the same author. So far I've tried random
scoring, which allows me to maintain pagination. However, I can still have
up to 4 of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain
field occurs in the result set? As far as I'm aware you cannot persist a
variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them
have quite a few cons. Such as removing the duplicates, and calling again
to retrieve a new set of results which have the current authors excluded.
However this can also return multiple of the same authors. So I'm left to
query one by one to replace duplicate authors in a result set, and this
breaks deep pagination because eventually the other result set which is
used to replace duplicates runs out of pages before the standard search.
I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a
document based on how many times a document of the same author(or field)
occurs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fa49e635-d174-4fff-b66d-628f5534f06c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.