Recency: Boost by Date

Shay,

Feel free to ping me if you need help integrating G++

Alex

Heya,

Worked with Mike from mvel a bit on this, and he improved mvel perf by 10x for this type of scripts. Its going to be included in upcoming 0.15 release.

-shay.banon
On Wednesday, January 19, 2011 at 5:08 PM, Karussell wrote:

I played with the master where you fixed the groovy thing. Thanks for
that btw :slight_smile: !

Now (only) 353k tweets:

  1. try mvel with string insertation of mynow/time => ~5s query time

now all experiments are done with mynow as paramter (instead string
insertation)
2. mvel => 5.2, 4.9, 5.1, 4.8
3. js => 1.0, 0.5, 0.6, 0.5
4. groovy => 1.1, 1.0, 0.8, 0.8
5. python => 1.5, 0.9, 0.9, 0.9

so, really js seems to be the fastest in my case :slight_smile:

mvel seems to be slower the more 'complex' my equation is.

unlike js, where it takes nearly always around 0.5s.

"You can never shave enough milliseconds" starts to
dominate my life too much... :slight_smile:

I wonder if there is an alternative approach to improve speed.
Couldn't be the document id's be cached in someway?

Or even detect which variables are used (in my case _score and
doc['dt'].value) and cache the results of the equation?

Another idea would be to calculate the result or the _score not too
precise (e.g. cut after 3 decimal points) and then use caching ...

Regards,
Peter.

On 18 Jan., 00:31, Shay Banon shay.ba...@elasticsearch.com wrote:

I still wonder if its worth it to have an optimized script engine for numeric based calcs, I will hack with it a bit and see if it make sense. "You can never shave enough milliseconds" starts to dominate my life too much... :slight_smile:

On Tuesday, January 18, 2011 at 1:29 AM, Shay Banon wrote:

Ha :). Well, those long nights doing deep integration with Rhino were worth it :). I think Groovy might actually be faster (I fixed the problem you mention in master). All three languages, javascript (Rhino), python (jython), and groovy have very low level integration that make them really fast to execute (sadly, haven't found the same optimizations possible with jruby).

Though, its strange that mvel is taking this long compared to rhino, sure, it might not get compiled "as much" into bytecode, but still. I will look into it.
On Tuesday, January 18, 2011 at 12:44 AM, Karussell wrote:

Using javascript the query now executes in under 0.8sec!!
(when was the last time we used js to improve performance :wink: ? ok, its
only an implementation ... but: nice!)

BTW1: using groovy I got:

Parse Failure [Failed to parse source [na]]]; nested:
ElasticSearchIllegalArgumentException[script_lang not supported
[groovy]];

I added the groovy plugin jar like I added the js plugin jar (via
maven)

BTW2: I had to restart the node to change the language engine. Would
you mind to add this into the docs? Otherwise one thinks that one is
using the new language for the script but is still using the first one!

Hi all,

Does anyone have a working example of what you did in the end to boost by recency - its something I'm interested in doing as well.

Cheers,

Ross

Hi Ross

Does anyone have a working example of what you did in the end to boost by
recency - its something I'm interested in doing as well.

The easiest (and probably fastest) way to include recency is to use a
range query: This query boosts docs from 2010 by 1, and docs from 2011
by 2:

curl -XGET 'http://127.0.0.1:9200/iannounce_object/_search?pretty=1' -d
'
{
"query" : {
"bool" : {
"must" : {
"text" : { "keywords" : "words to find"}
},
"should" : [
{
"range" : {
"publish_date" : { "gte" : "2011-01-01","boost" : 2}
}
},
{
"range" : {
"publish_date" : {
"lt" : "2011-01-01",
"gte" : "2010-01-01",
"boost" : 1
}
}
},
{
"range" : {
"publish_date" : { "lt" : "2010-01-01" }
}
}
]
}
}
}
'

clint

I think the fastest way (and the less flexible one) is putting the
time into a separate field while indexing :wink: and then boost or sort
against that field.

On 13 Jul., 10:34, Clinton Gormley clin...@iannounce.co.uk wrote:

Hi Ross

Does anyone have a working example of what you did in the end to boost by
recency - its something I'm interested in doing as well.

The easiest (and probably fastest) way to include recency is to use a
range query: This query boosts docs from 2010 by 1, and docs from 2011
by 2:

curl -XGET 'http://127.0.0.1:9200/iannounce_object/_search?pretty=1' -d
'
{
"query" : {
"bool" : {
"must" : {
"text" : { "keywords" : "words to find"}
},
"should" : [
{
"range" : {
"publish_date" : { "gte" : "2011-01-01","boost" : 2}
}
},
{
"range" : {
"publish_date" : {
"lt" : "2011-01-01",
"gte" : "2010-01-01",
"boost" : 1
}
}
},
{
"range" : {
"publish_date" : { "lt" : "2010-01-01" }
}
}
]
}
}}

'

clint

@Ross,

I use a script just like the one described at the top of this thread
(but with parameters to avoid changing the cached string each time),
eg

Map<String, Object> params = new HashMap<String, Object>();
params.put("now", nDecayTime);
params.put("tdecay", dInvDecay);
return QueryBuilders.customScoreQuery(currQuery).script("_score/(1.0 +
tdecay*abs(now - doc['publishedDate'].value))").params(params);

(I think it's obvious how to turn the above into JSON, though I'm too
lazy to do it here)

I also calculate and store the decay ("1.0/(1.0 + tdecay*abs(now -
doc['publishedDate'].value))") as a script field since it's needed in
the post processing, so I effectively do the above calculation twice
per record (you don't appear to be able to store a script field and
then use it in a "subsequent" script).

When adding this to a very simple test query matching everything (via
a term query against a specific field array that happens to return
true) the query time jumps from ~800ms to ~1300ms against ~1M
documents. I assume it therefore takes ~250ms per script per 1M
documents.

So these results seem consistent with the discussion above (using the
latest mvel implementation mentioned by Shay).

I normally get in the region of 100K results back from simple queries,
and more complex queries take longer anyway, so this has been low
priority for me to investigate further. Of course if I'm doing
something stupid then I'd love to hear it :slight_smile: otherwise, I hope this
helped!

On Jul 13, 7:09 am, Karussell tableyourt...@googlemail.com wrote:

I think the fastest way (and the less flexible one) is putting the
time into a separate field while indexing :wink: and then boost or sort
against that field.

On 13 Jul., 10:34, Clinton Gormley clin...@iannounce.co.uk wrote:

Hi Ross

Does anyone have a working example of what you did in the end to boost by
recency - its something I'm interested in doing as well.

The easiest (and probably fastest) way to include recency is to use a
range query: This query boosts docs from 2010 by 1, and docs from 2011
by 2:

curl -XGET 'http://127.0.0.1:9200/iannounce_object/_search?pretty=1' -d
'
{
"query" : {
"bool" : {
"must" : {
"text" : { "keywords" : "words to find"}
},
"should" : [
{
"range" : {
"publish_date" : { "gte" : "2011-01-01","boost" : 2}
}
},
{
"range" : {
"publish_date" : {
"lt" : "2011-01-01",
"gte" : "2010-01-01",
"boost" : 1
}
}
},
{
"range" : {
"publish_date" : { "lt" : "2010-01-01" }
}
}
]
}
}}

'

clint