Scala scripting - Native performance with script flexibility?


(Felipe Hummel) #1

Hey guys, I decided to took a couple of days to try out an idea.
Scala can compile Scala code inside your code and use it. More specifically
twitter-util https://github.com/twitter/util lib and its Eval module that
makes compiling code very easy. I've heard that you can do similar stuff in
Java.
So what if we could use Scala as just another scripting language in ES? We
would send a Scala code through 'script' field in customScore queries, an
ES plugin compiles the code to a regular Object and uses it in the scoring.

I implemented the idea, maybe with some issues, but it serves as a first
proof of concept. The code is here:


The relevant part is just this file:
ScalaScriptEngineService.scalahttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ScalaScriptEngineService.scala
My intention is to have a simple and easy API to access parameters, fields,
_source from the script, without the need to do casts.

An example scala script can be (quotes should be escaped when necessary):

( doubleParam("param1") * log(longField("a")) ) / doubleParam("param2")

It has double|float|longParam and long|float|doubleField methods in order
to avoid casting inside the script.
All scala.math._ functions are statically imported so that can be simply
used.

Main advantage: you can achieve the same (or almost) performance of a
Native java script but with the flexibility of changing and creating new
scripts at your will, without the need to restart ES or dealing with
deploying .jars

Performance:
I've been testing on a 6.5 million documents synthetic dataset with simple
structure (generateDataset.sh is inside the repo).
Using a match_all query to force scripting run on all documents, MVEL
script takes around 2 seconds, native Java scripts around 350-400ms and
"native" scala scripts around 420-480 ms. (the same thing is implemented in
the 3 scripts of course).
The index mapping/configuration is the default with 5 shards but no
replicas. This is in no way a scientific experience, just a micro benchmark
running on my Macbook Pro (4 virtual cores and 8GB RAM).

And finally, installing it:

git clone git@github.com:felipehummel/elasticsearch-lang-scala.git

cd elasticsearch-lang-scala
sbt assembly
#first create the plugins/lang-scala directories
cp
target/ScalaScriptsPlugin-assembly-0.1.0-SNAPSHOT.jar ES_DIRECTORY/plugins/lang-scala

Using it is just a matter of choosing "lang" : "scala" on your customScore
queries.

*
*
Some drawbacks:

  • the first query (after the ES node starts) using the scala type can be
    very slow. In my notebook it takes ~8 to 12 seconds (although the 5 shards
    my have something to it, it seems the compiler is initialized 5 times)
  • the second time you submit the same query it will use a regular JVM/Java
    Object to run, so there is no extra penalty. You may need some queries for
    the JIT to kick in.
  • After the first query, every time you submit a new unseen query it will
    take some time to compile. In my notebook, ranges from 2 to 5 seconds.
  • The script "reprocesses" the parameters for every document (every
    execution of runAsDouble()). For example
    herehttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ExampleNativeScript.scala#L18-26
    in
    a native script, you can just create member fields holding the
    processed/casted parameters and use the fields directly inside *runAsDouble.
    *While passing a script you can't (easily) say how you want your parameters
    to be processed/casted. This causes a reasonable overhead. If anyone has
    any ideas to remove this limitation, don't be shy =)
  • To avoid casting and handling doc fields inside the script the
    ScalaSearchScript class implements a few helper methods available inside
    the script. But these also come with an overhead. (the times reported above
    were measured while using these helper methods)

I have a few questions:

  • How can I log something from inside ScalaScriptEngineService?
  • By using AbstractDoubleSearchScript and making ScalaScript return Double,
    I'm forcing the user script to return a Double. What are the implications
    performance-wise? I guess I could try to compile returning Float then
    Double then Long, whichever does not fail compilation is used.
  • What the 'plugin -install' expects in a repository for it to be able to
    install? maven and a specific goal?
  • Date fields are Numeric? Are they stored as Long timestamp, milliseconds
    from epoch? In other words, how a field is converted from DocFieldData to a
    date object (MutableDateTime for example, as MVEL does).

And a few observations:

  • Executable/execute related stuff is being ignored from now.
  • The Scala compiler complains when I do '*
    docLookup.numeric(str).getDoubleValue*'. It evaluates '*
    docLookup.numeric(str)*' to 'Nothing' instead of 'NumericDocFieldData'.
  • The time for the java scripts are ~350 ms without running any scala
    script beforehand. If I run the java scripts after the scala scripts, then
    the java time goes up to ~400-430ms, which is the same as the scala script.
    I guess some JIT optimization is turned off when it finds more than one
    implementer of AbstractDoubleSearchScript. If I run the scala scripts first
    (without java native script), there is no difference.

Any ideas, suggestions? :slight_smile:

Felipe Hummel

--


(BillyEm) #2

are you proposing a third party plugin, or a plugin that the ES community
would have to maintain?

thx

On Monday, October 22, 2012 1:32:47 PM UTC-4, Felipe Hummel wrote:

Hey guys, I decided to took a couple of days to try out an idea.
Scala can compile Scala code inside your code and use it. More
specifically twitter-util https://github.com/twitter/util lib and its Eval
module that makes compiling code very easy. I've heard that you can do
similar stuff in Java.
So what if we could use Scala as just another scripting language in ES? We
would send a Scala code through 'script' field in customScore queries, an
ES plugin compiles the code to a regular Object and uses it in the scoring.

I implemented the idea, maybe with some issues, but it serves as a first
proof of concept. The code is here:
https://github.com/felipehummel/elasticsearch-lang-scala/
The relevant part is just this file: ScalaScriptEngineService.scalahttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ScalaScriptEngineService.scala
My intention is to have a simple and easy API to access parameters,
fields, _source from the script, without the need to do casts.

An example scala script can be (quotes should be escaped when necessary):

( doubleParam("param1") * log(longField("a")) ) / doubleParam("param2")

It has double|float|longParam and long|float|doubleField methods in order
to avoid casting inside the script.
All scala.math._ functions are statically imported so that can be simply
used.

Main advantage: you can achieve the same (or almost) performance of a
Native java script but with the flexibility of changing and creating new
scripts at your will, without the need to restart ES or dealing with
deploying .jars

Performance:
I've been testing on a 6.5 million documents synthetic dataset with simple
structure (generateDataset.sh is inside the repo).
Using a match_all query to force scripting run on all documents, MVEL
script takes around 2 seconds, native Java scripts around 350-400ms and
"native" scala scripts around 420-480 ms. (the same thing is implemented in
the 3 scripts of course).
The index mapping/configuration is the default with 5 shards but no
replicas. This is in no way a scientific experience, just a micro
benchmark running on my Macbook Pro (4 virtual cores and 8GB RAM).

And finally, installing it:

git clone git@github.com:felipehummel/elasticsearch-lang-scala.git

cd elasticsearch-lang-scala
sbt assembly
#first create the plugins/lang-scala directories
cp
target/ScalaScriptsPlugin-assembly-0.1.0-SNAPSHOT.jar ES_DIRECTORY/plugins/lang-scala

Using it is just a matter of choosing "lang" : "scala" on your
customScore queries.

*
*
Some drawbacks:

  • the first query (after the ES node starts) using the scala type can be
    very slow. In my notebook it takes ~8 to 12 seconds (although the 5 shards
    my have something to it, it seems the compiler is initialized 5 times)
  • the second time you submit the same query it will use a regular JVM/Java
    Object to run, so there is no extra penalty. You may need some queries for
    the JIT to kick in.
  • After the first query, every time you submit a new unseen query it will
    take some time to compile. In my notebook, ranges from 2 to 5 seconds.
  • The script "reprocesses" the parameters for every document (every
    execution of runAsDouble()). For example herehttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ExampleNativeScript.scala#L18-26 in
    a native script, you can just create member fields holding the
    processed/casted parameters and use the fields directly inside *runAsDouble.
    *While passing a script you can't (easily) say how you want your
    parameters to be processed/casted. This causes a reasonable overhead. If
    anyone has any ideas to remove this limitation, don't be shy =)
  • To avoid casting and handling doc fields inside the script the
    ScalaSearchScript class implements a few helper methods available inside
    the script. But these also come with an overhead. (the times reported above
    were measured while using these helper methods)

I have a few questions:

  • How can I log something from inside ScalaScriptEngineService?
  • By using AbstractDoubleSearchScript and making ScalaScript return
    Double, I'm forcing the user script to return a Double. What are the
    implications performance-wise? I guess I could try to compile returning
    Float then Double then Long, whichever does not fail compilation is used.
  • What the 'plugin -install' expects in a repository for it to be able to
    install? maven and a specific goal?
  • Date fields are Numeric? Are they stored as Long timestamp, milliseconds
    from epoch? In other words, how a field is converted from DocFieldData to a
    date object (MutableDateTime for example, as MVEL does).

And a few observations:

  • Executable/execute related stuff is being ignored from now.
  • The Scala compiler complains when I do '*
    docLookup.numeric(str).getDoubleValue*'. It evaluates '*
    docLookup.numeric(str)*' to 'Nothing' instead of 'NumericDocFieldData'
    .
  • The time for the java scripts are ~350 ms without running any scala
    script beforehand. If I run the java scripts after the scala scripts, then
    the java time goes up to ~400-430ms, which is the same as the scala script.
    I guess some JIT optimization is turned off when it finds more than one
    implementer of AbstractDoubleSearchScript. If I run the scala scripts first
    (without java native script), there is no difference.

Any ideas, suggestions? :slight_smile:

Felipe Hummel

--


(Felipe Hummel) #3

If the community has interest on it I'd be happy to "donate" the code and
hopefully continue to contribute.
Otherwise I'll just continue to develop it as my side project and would be
accepting any given help.

I have no further interest on it besides making it a useful plugin for the
developers who use ES.

Thanks

Felipe Hummel

On Monday, October 22, 2012 4:05:28 PM UTC-4, BillyEm wrote:

are you proposing a third party plugin, or a plugin that the ES community
would have to maintain?

thx

On Monday, October 22, 2012 1:32:47 PM UTC-4, Felipe Hummel wrote:

Hey guys, I decided to took a couple of days to try out an idea.
Scala can compile Scala code inside your code and use it. More
specifically twitter-util https://github.com/twitter/util lib and its Eval
module that makes compiling code very easy. I've heard that you can do
similar stuff in Java.
So what if we could use Scala as just another scripting language in ES?
We would send a Scala code through 'script' field in customScore queries,
an ES plugin compiles the code to a regular Object and uses it in the
scoring.

I implemented the idea, maybe with some issues, but it serves as a first
proof of concept. The code is here:
https://github.com/felipehummel/elasticsearch-lang-scala/
The relevant part is just this file: ScalaScriptEngineService.scalahttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ScalaScriptEngineService.scala
My intention is to have a simple and easy API to access parameters,
fields, _source from the script, without the need to do casts.

An example scala script can be (quotes should be escaped when necessary):

( doubleParam("param1") * log(longField("a")) ) / doubleParam("param2")

It has double|float|longParam and long|float|doubleField methods in
order to avoid casting inside the script.
All scala.math._ functions are statically imported so that can be simply
used.

Main advantage: you can achieve the same (or almost) performance of a
Native java script but with the flexibility of changing and creating new
scripts at your will, without the need to restart ES or dealing with
deploying .jars

Performance:
I've been testing on a 6.5 million documents synthetic dataset with
simple structure (generateDataset.sh is inside the repo).
Using a match_all query to force scripting run on all documents, MVEL
script takes around 2 seconds, native Java scripts around 350-400ms and
"native" scala scripts around 420-480 ms. (the same thing is implemented in
the 3 scripts of course).
The index mapping/configuration is the default with 5 shards but no
replicas. This is in no way a scientific experience, just a micro
benchmark running on my Macbook Pro (4 virtual cores and 8GB RAM).

And finally, installing it:

git clone git@github.com:felipehummel/elasticsearch-lang-scala.git

cd elasticsearch-lang-scala
sbt assembly
#first create the plugins/lang-scala directories
cp
target/ScalaScriptsPlugin-assembly-0.1.0-SNAPSHOT.jar ES_DIRECTORY/plugins/lang-scala

Using it is just a matter of choosing "lang" : "scala" on your
customScore queries.

*
*
Some drawbacks:

  • the first query (after the ES node starts) using the scala type can be
    very slow. In my notebook it takes ~8 to 12 seconds (although the 5 shards
    my have something to it, it seems the compiler is initialized 5 times)
  • the second time you submit the same query it will use a regular
    JVM/Java Object to run, so there is no extra penalty. You may need some
    queries for the JIT to kick in.
  • After the first query, every time you submit a new unseen query it will
    take some time to compile. In my notebook, ranges from 2 to 5 seconds.
  • The script "reprocesses" the parameters for every document (every
    execution of runAsDouble()). For example herehttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ExampleNativeScript.scala#L18-26 in
    a native script, you can just create member fields holding the
    processed/casted parameters and use the fields directly inside *runAsDouble.
    *While passing a script you can't (easily) say how you want your
    parameters to be processed/casted. This causes a reasonable overhead. If
    anyone has any ideas to remove this limitation, don't be shy =)
  • To avoid casting and handling doc fields inside the script the
    ScalaSearchScript class implements a few helper methods available inside
    the script. But these also come with an overhead. (the times reported above
    were measured while using these helper methods)

I have a few questions:

  • How can I log something from inside ScalaScriptEngineService?
  • By using AbstractDoubleSearchScript and making ScalaScript return
    Double, I'm forcing the user script to return a Double. What are the
    implications performance-wise? I guess I could try to compile returning
    Float then Double then Long, whichever does not fail compilation is used.
  • What the 'plugin -install' expects in a repository for it to be able to
    install? maven and a specific goal?
  • Date fields are Numeric? Are they stored as Long timestamp,
    milliseconds from epoch? In other words, how a field is converted from
    DocFieldData to a date object (MutableDateTime for example, as MVEL does).

And a few observations:

  • Executable/execute related stuff is being ignored from now.
  • The Scala compiler complains when I do '*
    docLookup.numeric(str).getDoubleValue*'. It evaluates '*
    docLookup.numeric(str)*' to 'Nothing' instead of '*NumericDocFieldData'
    *.
  • The time for the java scripts are ~350 ms without running any scala
    script beforehand. If I run the java scripts after the scala scripts, then
    the java time goes up to ~400-430ms, which is the same as the scala script.
    I guess some JIT optimization is turned off when it finds more than one
    implementer of AbstractDoubleSearchScript. If I run the scala scripts first
    (without java native script), there is no difference.

Any ideas, suggestions? :slight_smile:

Felipe Hummel

--


(system) #4