Scoring Function Scalability - v1.7.3

ndtreviv · February 5, 2016, 4:03pm

Hi all,

I've tried a few times to write a scoring function that scales into millions/billions of documents. At the moment, they don't even scale in the millions realm.

The latest one was a simple scoring script that looked something like this:

if(doc[param_field].value == null){
    return 0;
} else {
    BigInteger left = new BigInteger(doc[param_field].value, 16);
    BigInteger right = new BigInteger(param_hash, 16);
    return 400 - left.xor(right).bitCount();
}

This might seem a little bit naive as a script (I'm kinda new to this), but as you can see: It's essentially trying to calculate the hamming distance between a given parameter and a document field, and then score appropriately.

The "400" bit gives you some idea as to how long these fields are expected to be (not very long, remembering that this a bit count).

At the moment this takes far too long to run (~4s) on a document set of 17k. In my experience with my previous attempt at a scoring script (that one was calculating euclidean distance), the execution time gets larger as the document base grows.

My question is really around the scalability of scoring function scripts.

How can I write one that scales?
Why is mine so slow?
Should I expect it to be slow?

If anyone could point me towards any resources advising on scoring script optimisation etc that would be great!

Thanks for any help!

Ivan · February 5, 2016, 4:41pm

You can compile scripts and have them loaded in a jar, but most scripts are
cached anyways. What is param_hash?

Ivan

ndtreviv · February 5, 2016, 7:22pm

Thanks, Ivan. That is true for v 2, but not for v1.7.3 unfortunately. At least, it doesn't have it in the docs for my version.

I could upgrade, but I want to be sure it'll be worth it.

To answer your question, param_hash is a hex encoded binary string, much like my param_field value. The strings are actually phashes or dhashes of images (the docs in the es index represent images), and in comparing hamming distance I'm comparing image similarity.

Thanks again for your help.

Ivan · February 5, 2016, 7:49pm

You can definitely use Java-based scripts in 1.x:

https://www.elastic.co/guide/en/elasticsearch/reference/1.7/modules-scripting.html#native-java-scripts

Where is param_hash coming from? I cannot tell from your script. You should
use stored fields and not source (which is what it appears that you are
doing). Are doc_values enabled on those fields?

Ivan

ndtreviv · February 8, 2016, 3:23pm

Re: scripting in 1.x - didn't see that there! The docs for 2.2 are definitely clearer in this area. Do the docs still apply to 1.x? It's easier to follow instructions when doing something new for the first time...

To answer your question re:param_hash: it's a parameter that I pass in with the scoring script info, for example:

curl -XPOST 'http://localhost:9200/scf/_search?pretty' -d '{
  "query": {
    "function_score": {    	
      "query":{
          ...
      },
      "functions": [
        {
          "script_score": {
            "script_file": "hamming_distance",
            "lang" : "groovy",
            "params": {
              "param_hash": "b212f1007190e430d634d6324a3a4b3a4d6a6758403c455841cc11cd30c7325b",
              "param_field":"dhash"
            }
          }
        }
      ]
    }
  }
}'

I can't store this parameter in the document, because it IS a query-time parameter.

Re: doc_values being enabled, I don't believe they are. I thought I had set these fields as being "not_analyzed", which would make doc_values set to true by default, but as I've just checked my mappings I can see that they're not set like that after all. So no, doc_values are not enabled. But doc_values are slower than in memory fielddata, right?

I'll have a go at the scripting in 1.x.

ndtreviv · February 8, 2016, 4:42pm

OK, so I've written the native script, and installed it:
sudo bin/plugin -u file:///usr/local/bin/elasticsearch-1.7.3/plugins/cameraforensics/cf-elasticsearch-plugins-hammingdistance-0.1.0.jar -i hamming_distance

It didn't show under the list of loaded plugins, so I restarted and there it was:
$ curl 'http://localhost:9200/_cat/plugins?v' name component version type url George Washington Bridge hamming_distance NA j

And it's there in the logs:
[2016-02-08 16:38:15,745][INFO ][plugins ] [George Washington Bridge] loaded [mapper-attachments, marvel, knapsack-1.7.2.0-954d066, **hamming_distance**, cloud-aws], sites [marvel, bigdesk]

BUT when I try to run it:

curl -XPOST 'http://localhost:9200/scf/_search?pretty' -d '{
  "query": {
    "function_score": {
      "query":{
        "match_all": {}
      },
      "functions": [
        {
          "script_score": {
            "script": "hamming_distance",
            "lang" : "native",
            "params": {
              "param_hash": "b212f1007190e430d634d6324a3a4b3a4d6a6758403c455841cc11cd30c7325b",
              "param_field": "dhash"
            }
          }
        }
      ]
    }
  }
}'

I get an error:
ElasticsearchIllegalArgumentException[Native script [hamming_distance] not found]

Any ideas?

This is what my es-plugin.properties looks like:
plugin=com.cameraforensics.elasticsearch.plugins.HammingDistancePlugin
name=hamming_distance
version=0.1.0

ndtreviv · February 8, 2016, 4:47pm

PS: I also have a groovy script in /config/scripts called hamming_distance.groovy. Could they be conflicting?

ndtreviv · February 9, 2016, 1:35pm

I've resolved this now.

I actually needed to add an entry into the elasticsearch.yml file referencing my NativeScriptFactory implementation, like so:

script.native:
  mynativescript.type: org.spacevatican.elasticsearchexample.CustomScriptFactory

as mentioned here:
http://www.spacevatican.org/2012/5/12/elasticsearch-native-scripts-for-dummies/

On the initial data set, it appears to be faster by orders of magnitude, which is nice.
Now I just need to test against a few more tens-of-millions.

Thanks again for your help!

Ivan · February 11, 2016, 8:30am

Sorry if I did not catch your response, but I'm glad you found it. Doc
values are not enabled by default in 1.x. It is worth giving a shot. They
might be slower to load (and index), but can be faster depending on your
types of lookup.

Ivan

Topic		Replies	Views
ES 1.6 to ES 5.1.2 upgrade - script scoring Elasticsearch	1	394	February 20, 2017
Elastic 2.x function_score week performance with script Elasticsearch	3	851	July 5, 2017
Performance issue with script scoring with fields having a large array Elasticsearch	2	1588	July 6, 2017
Incrementing param value within a script Elasticsearch painless	3	329	March 19, 2021
Filtering documents by score via scripting Elasticsearch	1	366	July 6, 2017

Scoring Function Scalability - v1.7.3

Related topics