Hi there. We are trying to use some very dynamic fields as part of a script
on scoring. We want to use things like downloads, daily/weekly/monthly and
rating to come up with a hotness factor.
But here's our big dilema:
If we were using just the total downloads, it would be no problem at all,
but since we also need partials per week and month, we end up updating a
lot of documents. For example, it could be that a given song was not
downloaded in 9 days, but the monthly download would decrease each day, as
it is counted from the past 30 days. We are trying to avoid pushing so many
docs to the index, and here's the reason: After trying this on a live
cluster, we pushed around 20% of the whole index size (that was our whole
delta), now please correct me if I'm wrong, lucene won't update the docs,
it will mark for delete and create another one right? Thus increasing the
segments file. We then ran an optimize, and this brought the cluster to its
knees
So we are experimenting different things, one of them is to use some sort
of script to calculate the weekly and monthly downloads, I would like to
share here my thoughts, and as stupid they might look, I would love to hear
feedback or any other approaches the community is using.
So here's what I came up with:
{
downloads: Integer,
lastUpdated: Date,
monthly: Integer[]
}
Let me put an simple example, say we have a record like this, and assume
today is 01/01/2013
{
downloads: 1000,
lastUpdated:01/01/2013
monthly: [10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]
}
Here's the algorithm to calculate the monthly download:
calculate the difference between today and lastUpdated in days
shift the array by number of days, padding as many zeros as the number of
days
monthly := sum of array
So, let's imagine that now = 01/05/2013
We haven't updated this record since nothing changed on downloads. To get
the number of downloads we would "transform" the array in something like
this:
[0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
So monthly downloads now is 10 instead of 11.
Basically is to use a circular buffer to hold the download information.
The idea is just to avoid updating too many records (and only because we
had problems with the segments), what we see as a problem tough is the fact
that this script is more costly than having a static number updated every
day.
Any thoughts on that would be much appreciated.
Regards
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.