Scala scripting - Native performance with script flexibility?

Felipe_Hummel · October 22, 2012, 5:32pm

Hey guys, I decided to took a couple of days to try out an idea.
Scala can compile Scala code inside your code and use it. More specifically
twitter-util https://github.com/twitter/util lib and its Eval module that
makes compiling code very easy. I've heard that you can do similar stuff in
Java.
So what if we could use Scala as just another scripting language in ES? We
would send a Scala code through 'script' field in customScore queries, an
ES plugin compiles the code to a regular Object and uses it in the scoring.

I implemented the idea, maybe with some issues, but it serves as a first
proof of concept. The code is here:

The relevant part is just this file:
ScalaScriptEngineService.scalahttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ScalaScriptEngineService.scala
My intention is to have a simple and easy API to access parameters, fields,
_source from the script, without the need to do casts.

An example scala script can be (quotes should be escaped when necessary):

( doubleParam("param1") * log(longField("a")) ) / doubleParam("param2")

It has double|float|longParam and long|float|doubleField methods in order
to avoid casting inside the script.
All scala.math._ functions are statically imported so that can be simply
used.

Main advantage: you can achieve the same (or almost) performance of a
Native java script but with the flexibility of changing and creating new
scripts at your will, without the need to restart ES or dealing with
deploying .jars

Performance:
I've been testing on a 6.5 million documents synthetic dataset with simple
structure (generateDataset.sh is inside the repo).
Using a match_all query to force scripting run on all documents, MVEL
script takes around 2 seconds, native Java scripts around 350-400ms and
"native" scala scripts around 420-480 ms. (the same thing is implemented in
the 3 scripts of course).
The index mapping/configuration is the default with 5 shards but no
replicas. This is in no way a scientific experience, just a micro benchmark
running on my Macbook Pro (4 virtual cores and 8GB RAM).

And finally, installing it:

git clone git@github.com:felipehummel/elasticsearch-lang-scala.git

cd elasticsearch-lang-scala
sbt assembly
#first create the plugins/lang-scala directories
cp
target/ScalaScriptsPlugin-assembly-0.1.0-SNAPSHOT.jar ES_DIRECTORY/plugins/lang-scala

Using it is just a matter of choosing "lang" : "scala" on your customScore
queries.
*
*
Some drawbacks:

the first query (after the ES node starts) using the scala type can be
very slow. In my notebook it takes ~8 to 12 seconds (although the 5 shards
my have something to it, it seems the compiler is initialized 5 times)
the second time you submit the same query it will use a regular JVM/Java
Object to run, so there is no extra penalty. You may need some queries for
the JIT to kick in.
After the first query, every time you submit a new unseen query it will
take some time to compile. In my notebook, ranges from 2 to 5 seconds.
The script "reprocesses" the parameters for every document (every
execution of runAsDouble()). For example
herehttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ExampleNativeScript.scala#L18-26
in
a native script, you can just create member fields holding the
processed/casted parameters and use the fields directly inside *runAsDouble.
*While passing a script you can't (easily) say how you want your parameters
to be processed/casted. This causes a reasonable overhead. If anyone has
any ideas to remove this limitation, don't be shy =)
To avoid casting and handling doc fields inside the script the
ScalaSearchScript class implements a few helper methods available inside
the script. But these also come with an overhead. (the times reported above
were measured while using these helper methods)

I have a few questions:

How can I log something from inside ScalaScriptEngineService?
By using AbstractDoubleSearchScript and making ScalaScript return Double,
I'm forcing the user script to return a Double. What are the implications
performance-wise? I guess I could try to compile returning Float then
Double then Long, whichever does not fail compilation is used.
What the 'plugin -install' expects in a repository for it to be able to
install? maven and a specific goal?
Date fields are Numeric? Are they stored as Long timestamp, milliseconds
from epoch? In other words, how a field is converted from DocFieldData to a
date object (MutableDateTime for example, as MVEL does).

And a few observations:

Executable/execute related stuff is being ignored from now.
The Scala compiler complains when I do '*
docLookup.numeric(str).getDoubleValue*'. It evaluates '*
docLookup.numeric(str)*' to 'Nothing' instead of 'NumericDocFieldData'.
The time for the java scripts are ~350 ms without running any scala
script beforehand. If I run the java scripts after the scala scripts, then
the java time goes up to ~400-430ms, which is the same as the scala script.
I guess some JIT optimization is turned off when it finds more than one
implementer of AbstractDoubleSearchScript. If I run the scala scripts first
(without java native script), there is no difference.

Any ideas, suggestions?

Felipe Hummel

--

BillyEm · October 22, 2012, 8:05pm

are you proposing a third party plugin, or a plugin that the ES community
would have to maintain?

thx

On Monday, October 22, 2012 1:32:47 PM UTC-4, Felipe Hummel wrote:

Hey guys, I decided to took a couple of days to try out an idea.
Scala can compile Scala code inside your code and use it. More
specifically twitter-util https://github.com/twitter/util lib and its Eval
module that makes compiling code very easy. I've heard that you can do
similar stuff in Java.
So what if we could use Scala as just another scripting language in ES? We
would send a Scala code through 'script' field in customScore queries, an
ES plugin compiles the code to a regular Object and uses it in the scoring.

I implemented the idea, maybe with some issues, but it serves as a first
proof of concept. The code is here:
GitHub - felipehummel/elasticsearch-lang-scala: Scala Scripting plugin for ElasticSearch
The relevant part is just this file: ScalaScriptEngineService.scalahttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ScalaScriptEngineService.scala
My intention is to have a simple and easy API to access parameters,
fields, _source from the script, without the need to do casts.

An example scala script can be (quotes should be escaped when necessary):

( doubleParam("param1") * log(longField("a")) ) / doubleParam("param2")

It has double|float|longParam and long|float|doubleField methods in order
to avoid casting inside the script.
All scala.math._ functions are statically imported so that can be simply
used.

Main advantage: you can achieve the same (or almost) performance of a
Native java script but with the flexibility of changing and creating new
scripts at your will, without the need to restart ES or dealing with
deploying .jars

Performance:
I've been testing on a 6.5 million documents synthetic dataset with simple
structure (generateDataset.sh is inside the repo).
Using a match_all query to force scripting run on all documents, MVEL
script takes around 2 seconds, native Java scripts around 350-400ms and
"native" scala scripts around 420-480 ms. (the same thing is implemented in
the 3 scripts of course).
The index mapping/configuration is the default with 5 shards but no
replicas. This is in no way a scientific experience, just a micro
benchmark running on my Macbook Pro (4 virtual cores and 8GB RAM).

And finally, installing it:

git clone git@github.com:felipehummel/elasticsearch-lang-scala.git

cd elasticsearch-lang-scala
sbt assembly
#first create the plugins/lang-scala directories
cp
target/ScalaScriptsPlugin-assembly-0.1.0-SNAPSHOT.jar ES_DIRECTORY/plugins/lang-scala

Using it is just a matter of choosing "lang" : "scala" on your
customScore queries.
*
*
Some drawbacks:

the first query (after the ES node starts) using the scala type can be
very slow. In my notebook it takes ~8 to 12 seconds (although the 5 shards
my have something to it, it seems the compiler is initialized 5 times)

the second time you submit the same query it will use a regular JVM/Java
Object to run, so there is no extra penalty. You may need some queries for
the JIT to kick in.

After the first query, every time you submit a new unseen query it will
take some time to compile. In my notebook, ranges from 2 to 5 seconds.

The script "reprocesses" the parameters for every document (every
execution of runAsDouble()). For example herehttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ExampleNativeScript.scala#L18-26 in
a native script, you can just create member fields holding the
processed/casted parameters and use the fields directly inside *runAsDouble.
*While passing a script you can't (easily) say how you want your
parameters to be processed/casted. This causes a reasonable overhead. If
anyone has any ideas to remove this limitation, don't be shy =)

To avoid casting and handling doc fields inside the script the
ScalaSearchScript class implements a few helper methods available inside
the script. But these also come with an overhead. (the times reported above
were measured while using these helper methods)

I have a few questions:

How can I log something from inside ScalaScriptEngineService?

By using AbstractDoubleSearchScript and making ScalaScript return
Double, I'm forcing the user script to return a Double. What are the
implications performance-wise? I guess I could try to compile returning
Float then Double then Long, whichever does not fail compilation is used.

What the 'plugin -install' expects in a repository for it to be able to
install? maven and a specific goal?

Date fields are Numeric? Are they stored as Long timestamp, milliseconds
from epoch? In other words, how a field is converted from DocFieldData to a
date object (MutableDateTime for example, as MVEL does).

And a few observations:

Executable/execute related stuff is being ignored from now.

The Scala compiler complains when I do '*
docLookup.numeric(str).getDoubleValue*'. It evaluates '*
docLookup.numeric(str)*' to 'Nothing' instead of 'NumericDocFieldData'
.

The time for the java scripts are ~350 ms without running any scala
script beforehand. If I run the java scripts after the scala scripts, then
the java time goes up to ~400-430ms, which is the same as the scala script.
I guess some JIT optimization is turned off when it finds more than one
implementer of AbstractDoubleSearchScript. If I run the scala scripts first
(without java native script), there is no difference.

Any ideas, suggestions?

Felipe Hummel

--

Felipe_Hummel · October 22, 2012, 11:25pm

If the community has interest on it I'd be happy to "donate" the code and
hopefully continue to contribute.
Otherwise I'll just continue to develop it as my side project and would be
accepting any given help.

I have no further interest on it besides making it a useful plugin for the
developers who use ES.

Thanks

Felipe Hummel

On Monday, October 22, 2012 4:05:28 PM UTC-4, BillyEm wrote:

are you proposing a third party plugin, or a plugin that the ES community
would have to maintain?

thx

On Monday, October 22, 2012 1:32:47 PM UTC-4, Felipe Hummel wrote:

Hey guys, I decided to took a couple of days to try out an idea.
Scala can compile Scala code inside your code and use it. More
specifically twitter-util https://github.com/twitter/util lib and its Eval
module that makes compiling code very easy. I've heard that you can do
similar stuff in Java.
So what if we could use Scala as just another scripting language in ES?
We would send a Scala code through 'script' field in customScore queries,
an ES plugin compiles the code to a regular Object and uses it in the
scoring.

I implemented the idea, maybe with some issues, but it serves as a first
proof of concept. The code is here:
GitHub - felipehummel/elasticsearch-lang-scala: Scala Scripting plugin for ElasticSearch
The relevant part is just this file: ScalaScriptEngineService.scalahttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ScalaScriptEngineService.scala
My intention is to have a simple and easy API to access parameters,
fields, _source from the script, without the need to do casts.

An example scala script can be (quotes should be escaped when necessary):

( doubleParam("param1") * log(longField("a")) ) / doubleParam("param2")

It has double|float|longParam and long|float|doubleField methods in
order to avoid casting inside the script.
All scala.math._ functions are statically imported so that can be simply
used.

Main advantage: you can achieve the same (or almost) performance of a
Native java script but with the flexibility of changing and creating new
scripts at your will, without the need to restart ES or dealing with
deploying .jars

Performance:
I've been testing on a 6.5 million documents synthetic dataset with
simple structure (generateDataset.sh is inside the repo).
Using a match_all query to force scripting run on all documents, MVEL
script takes around 2 seconds, native Java scripts around 350-400ms and
"native" scala scripts around 420-480 ms. (the same thing is implemented in
the 3 scripts of course).
The index mapping/configuration is the default with 5 shards but no
replicas. This is in no way a scientific experience, just a micro
benchmark running on my Macbook Pro (4 virtual cores and 8GB RAM).

And finally, installing it:

git clone git@github.com:felipehummel/elasticsearch-lang-scala.git

cd elasticsearch-lang-scala
sbt assembly
#first create the plugins/lang-scala directories
cp
target/ScalaScriptsPlugin-assembly-0.1.0-SNAPSHOT.jar ES_DIRECTORY/plugins/lang-scala

Using it is just a matter of choosing "lang" : "scala" on your
customScore queries.
*
*
Some drawbacks:

the first query (after the ES node starts) using the scala type can be
very slow. In my notebook it takes ~8 to 12 seconds (although the 5 shards
my have something to it, it seems the compiler is initialized 5 times)

the second time you submit the same query it will use a regular
JVM/Java Object to run, so there is no extra penalty. You may need some
queries for the JIT to kick in.

After the first query, every time you submit a new unseen query it will
take some time to compile. In my notebook, ranges from 2 to 5 seconds.

The script "reprocesses" the parameters for every document (every
execution of runAsDouble()). For example herehttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ExampleNativeScript.scala#L18-26 in
a native script, you can just create member fields holding the
processed/casted parameters and use the fields directly inside *runAsDouble.
*While passing a script you can't (easily) say how you want your
parameters to be processed/casted. This causes a reasonable overhead. If
anyone has any ideas to remove this limitation, don't be shy =)

To avoid casting and handling doc fields inside the script the
ScalaSearchScript class implements a few helper methods available inside
the script. But these also come with an overhead. (the times reported above
were measured while using these helper methods)

I have a few questions:

How can I log something from inside ScalaScriptEngineService?

By using AbstractDoubleSearchScript and making ScalaScript return
Double, I'm forcing the user script to return a Double. What are the
implications performance-wise? I guess I could try to compile returning
Float then Double then Long, whichever does not fail compilation is used.

What the 'plugin -install' expects in a repository for it to be able to
install? maven and a specific goal?

Date fields are Numeric? Are they stored as Long timestamp,
milliseconds from epoch? In other words, how a field is converted from
DocFieldData to a date object (MutableDateTime for example, as MVEL does).

And a few observations:

Executable/execute related stuff is being ignored from now.

The Scala compiler complains when I do '*
docLookup.numeric(str).getDoubleValue*'. It evaluates '*
docLookup.numeric(str)*' to 'Nothing' instead of '*NumericDocFieldData'
*.

The time for the java scripts are ~350 ms without running any scala
script beforehand. If I run the java scripts after the scala scripts, then
the java time goes up to ~400-430ms, which is the same as the scala script.
I guess some JIT optimization is turned off when it finds more than one
implementer of AbstractDoubleSearchScript. If I run the scala scripts first
(without java native script), there is no difference.

Any ideas, suggestions?

Felipe Hummel

--

Topic		Replies	Views
MVEL vs Javascript scripting plugin? Which scripting plugin is recommended for speed? Elasticsearch	3	1216	July 6, 2017
LUA like scripting support in Elastic search Elasticsearch	1	673	June 27, 2017
Is it possible to produce a script field for use in an Aggregation Elasticsearch	3	413	December 22, 2016
Update requests with native java scripts Elasticsearch	1	440	February 8, 2017
Benchmark scripts/code? Elasticsearch	10	454	July 6, 2017

Scala scripting - Native performance with script flexibility?

Related topics