Dynamic fields on scoring

Vinicius_Carvalho · April 10, 2013, 2:49pm

Hi there. We are trying to use some very dynamic fields as part of a script
on scoring. We want to use things like downloads, daily/weekly/monthly and
rating to come up with a hotness factor.

But here's our big dilema:

If we were using just the total downloads, it would be no problem at all,
but since we also need partials per week and month, we end up updating a
lot of documents. For example, it could be that a given song was not
downloaded in 9 days, but the monthly download would decrease each day, as
it is counted from the past 30 days. We are trying to avoid pushing so many
docs to the index, and here's the reason: After trying this on a live
cluster, we pushed around 20% of the whole index size (that was our whole
delta), now please correct me if I'm wrong, lucene won't update the docs,
it will mark for delete and create another one right? Thus increasing the
segments file. We then ran an optimize, and this brought the cluster to its
knees

So we are experimenting different things, one of them is to use some sort
of script to calculate the weekly and monthly downloads, I would like to
share here my thoughts, and as stupid they might look, I would love to hear
feedback or any other approaches the community is using.

So here's what I came up with:

{
downloads: Integer,
lastUpdated: Date,
monthly: Integer[]
}

Let me put an simple example, say we have a record like this, and assume
today is 01/01/2013

{
downloads: 1000,
lastUpdated:01/01/2013
monthly: [10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]
}

Here's the algorithm to calculate the monthly download:

calculate the difference between today and lastUpdated in days
shift the array by number of days, padding as many zeros as the number of
days
monthly := sum of array

So, let's imagine that now = 01/05/2013

We haven't updated this record since nothing changed on downloads. To get
the number of downloads we would "transform" the array in something like
this:

[0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

So monthly downloads now is 10 instead of 11.

Basically is to use a circular buffer to hold the download information.

The idea is just to avoid updating too many records (and only because we
had problems with the segments), what we see as a problem tough is the fact
that this script is more costly than having a static number updated every
day.

Any thoughts on that would be much appreciated.

Regards

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · April 12, 2013, 9:46am

Hiya

On Wed, 2013-04-10 at 07:49 -0700, Vinicius Carvalho wrote:

Hi there. We are trying to use some very dynamic fields as part of a
script on scoring. We want to use things like downloads,
daily/weekly/monthly and rating to come up with a hotness factor.

Indeed, a thorny problem...

If we were using just the total downloads, it would be no problem at
all, but since we also need partials per week and month, we end up
updating a lot of documents. For example, it could be that a given
song was not downloaded in 9 days, but the monthly download would
decrease each day, as it is counted from the past 30 days. We are
trying to avoid pushing so many docs to the index, and here's the
reason: After trying this on a live cluster, we pushed around 20% of
the whole index size (that was our whole delta), now please correct me
if I'm wrong, lucene won't update the docs, it will mark for delete
and create another one right? Thus increasing the segments file. We
then ran an optimize, and this brought the cluster to its knees

Don't run optimize. It is a very heavy action, as you have already
experienced Optimize is really useful for indices which are no
longer going to change, like log data for the previous week.

You can let the merge process take care of rewriting segments as
required. You may want to add some merge throttling to avoid the IO
from merging overwhelming the cluster too.

So we are experimenting different things, one of them is to use some
sort of script to calculate the weekly and monthly downloads, I would
like to share here my thoughts, and as stupid they might look, I would
love to hear feedback or any other approaches the community is using.

Your thoughts seem perfectly reasonable, and your concerns with the
impact of using a script seems perfectly reasonable as well.

I'm unsure which is the better option:

updating all download docs daily
using the script to calculate on the fly.

Certainly option (1) would be heavier at indexing time, but perform
better at search time. Even though you end up reindexing a lot of docs,
they are small and it may be a feasible solution, especially if you
reindex them using the bulk APIs instead of one by one.

The performance impact of option (2) might be mitigated by using the new
'rescorer' functionality available in the 0.90 branch. This would allow
to run your basic query on the full index, then rescore eg the top 500
results with the "hotness" factor using a script.

However, if you don't have a "basic query" and you're calculating the
score based purely on hotness, then you'll end up running that query on
all docs in the index, so that won't help much. That said, the results
from this query will only change once per day, so you could just cache
them.

So short answer: test both methods out and see which works best.

And please report back - I'd be interested to hear what your tests show.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.