Hey guys, I decided to took a couple of days to try out an idea.
Scala can compile Scala code inside your code and use it. More specifically
twitter-util https://github.com/twitter/util lib and its Eval module that
makes compiling code very easy. I've heard that you can do similar stuff in
Java.
So what if we could use Scala as just another scripting language in ES? We
would send a Scala code through 'script' field in customScore queries, an
ES plugin compiles the code to a regular Object and uses it in the scoring.
I implemented the idea, maybe with some issues, but it serves as a first
proof of concept. The code is here:
The relevant part is just this file:
ScalaScriptEngineService.scalahttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ScalaScriptEngineService.scala
My intention is to have a simple and easy API to access parameters, fields,
_source from the script, without the need to do casts.
An example scala script can be (quotes should be escaped when necessary):
( doubleParam("param1") * log(longField("a")) ) / doubleParam("param2")
It has double|float|longParam and long|float|doubleField methods in order
to avoid casting inside the script.
All scala.math._ functions are statically imported so that can be simply
used.
Main advantage: you can achieve the same (or almost) performance of a
Native java script but with the flexibility of changing and creating new
scripts at your will, without the need to restart ES or dealing with
deploying .jars
Performance:
I've been testing on a 6.5 million documents synthetic dataset with simple
structure (generateDataset.sh is inside the repo).
Using a match_all query to force scripting run on all documents, MVEL
script takes around 2 seconds, native Java scripts around 350-400ms and
"native" scala scripts around 420-480 ms. (the same thing is implemented in
the 3 scripts of course).
The index mapping/configuration is the default with 5 shards but no
replicas. This is in no way a scientific experience, just a micro benchmark
running on my Macbook Pro (4 virtual cores and 8GB RAM).
And finally, installing it:
git clone git@github.com:felipehummel/elasticsearch-lang-scala.git
cd elasticsearch-lang-scala
sbt assembly
#first create the plugins/lang-scala directories
cp
target/ScalaScriptsPlugin-assembly-0.1.0-SNAPSHOT.jar ES_DIRECTORY/plugins/lang-scala
Using it is just a matter of choosing "lang" : "scala" on your customScore
queries.
*
*
Some drawbacks:
- the first query (after the ES node starts) using the scala type can be
very slow. In my notebook it takes ~8 to 12 seconds (although the 5 shards
my have something to it, it seems the compiler is initialized 5 times) - the second time you submit the same query it will use a regular JVM/Java
Object to run, so there is no extra penalty. You may need some queries for
the JIT to kick in. - After the first query, every time you submit a new unseen query it will
take some time to compile. In my notebook, ranges from 2 to 5 seconds. - The script "reprocesses" the parameters for every document (every
execution of runAsDouble()). For example
herehttps://github.com/felipehummel/elasticsearch-lang-scala/blob/master/src/main/scala/org/elasticsearch/script/scala/ExampleNativeScript.scala#L18-26
in
a native script, you can just create member fields holding the
processed/casted parameters and use the fields directly inside *runAsDouble.
*While passing a script you can't (easily) say how you want your parameters
to be processed/casted. This causes a reasonable overhead. If anyone has
any ideas to remove this limitation, don't be shy =) - To avoid casting and handling doc fields inside the script the
ScalaSearchScript class implements a few helper methods available inside
the script. But these also come with an overhead. (the times reported above
were measured while using these helper methods)
I have a few questions:
- How can I log something from inside ScalaScriptEngineService?
- By using AbstractDoubleSearchScript and making ScalaScript return Double,
I'm forcing the user script to return a Double. What are the implications
performance-wise? I guess I could try to compile returning Float then
Double then Long, whichever does not fail compilation is used. - What the 'plugin -install' expects in a repository for it to be able to
install? maven and a specific goal? - Date fields are Numeric? Are they stored as Long timestamp, milliseconds
from epoch? In other words, how a field is converted from DocFieldData to a
date object (MutableDateTime for example, as MVEL does).
And a few observations:
- Executable/execute related stuff is being ignored from now.
- The Scala compiler complains when I do '*
docLookup.numeric(str).getDoubleValue*'. It evaluates '*
docLookup.numeric(str)*' to 'Nothing' instead of 'NumericDocFieldData'. - The time for the java scripts are ~350 ms without running any scala
script beforehand. If I run the java scripts after the scala scripts, then
the java time goes up to ~400-430ms, which is the same as the scala script.
I guess some JIT optimization is turned off when it finds more than one
implementer of AbstractDoubleSearchScript. If I run the scala scripts first
(without java native script), there is no difference.
Any ideas, suggestions?
Felipe Hummel
--