Scoring Function Scalability - v1.7.3


(Nathan Trevivian) #1

Hi all,

I've tried a few times to write a scoring function that scales into millions/billions of documents. At the moment, they don't even scale in the millions realm.

The latest one was a simple scoring script that looked something like this:

if(doc[param_field].value == null){
    return 0;
} else {
    BigInteger left = new BigInteger(doc[param_field].value, 16);
    BigInteger right = new BigInteger(param_hash, 16);
    return 400 - left.xor(right).bitCount();
}

This might seem a little bit naive as a script (I'm kinda new to this), but as you can see: It's essentially trying to calculate the hamming distance between a given parameter and a document field, and then score appropriately.

The "400" bit gives you some idea as to how long these fields are expected to be (not very long, remembering that this a bit count).

At the moment this takes far too long to run (~4s) on a document set of 17k. In my experience with my previous attempt at a scoring script (that one was calculating euclidean distance), the execution time gets larger as the document base grows.

My question is really around the scalability of scoring function scripts.

  • How can I write one that scales?
  • Why is mine so slow?
  • Should I expect it to be slow?

If anyone could point me towards any resources advising on scoring script optimisation etc that would be great!

Thanks for any help!


(Ivan Brusic) #2

You can compile scripts and have them loaded in a jar, but most scripts are
cached anyways. What is param_hash?

Ivan


(Nathan Trevivian) #3

Thanks, Ivan. That is true for v 2, but not for v1.7.3 unfortunately. At least, it doesn't have it in the docs for my version.

I could upgrade, but I want to be sure it'll be worth it.

To answer your question, param_hash is a hex encoded binary string, much like my param_field value. The strings are actually phashes or dhashes of images (the docs in the es index represent images), and in comparing hamming distance I'm comparing image similarity.

Thanks again for your help.


(Ivan Brusic) #4

You can definitely use Java-based scripts in 1.x:

https://www.elastic.co/guide/en/elasticsearch/reference/1.7/modules-scripting.html#native-java-scripts

Where is param_hash coming from? I cannot tell from your script. You should
use stored fields and not source (which is what it appears that you are
doing). Are doc_values enabled on those fields?

Ivan


(Nathan Trevivian) #5

Re: scripting in 1.x - didn't see that there! The docs for 2.2 are definitely clearer in this area. Do the docs still apply to 1.x? It's easier to follow instructions when doing something new for the first time...

To answer your question re:param_hash: it's a parameter that I pass in with the scoring script info, for example:

curl -XPOST 'http://localhost:9200/scf/_search?pretty' -d '{
  "query": {
    "function_score": {    	
      "query":{
          ...
      },
      "functions": [
        {
          "script_score": {
            "script_file": "hamming_distance",
            "lang" : "groovy",
            "params": {
              "param_hash": "b212f1007190e430d634d6324a3a4b3a4d6a6758403c455841cc11cd30c7325b",
              "param_field":"dhash"
            }
          }
        }
      ]
    }
  }
}'

I can't store this parameter in the document, because it IS a query-time parameter.

Re: doc_values being enabled, I don't believe they are. I thought I had set these fields as being "not_analyzed", which would make doc_values set to true by default, but as I've just checked my mappings I can see that they're not set like that after all. So no, doc_values are not enabled. But doc_values are slower than in memory fielddata, right?

I'll have a go at the scripting in 1.x.


(Nathan Trevivian) #6

OK, so I've written the native script, and installed it:
sudo bin/plugin -u file:///usr/local/bin/elasticsearch-1.7.3/plugins/cameraforensics/cf-elasticsearch-plugins-hammingdistance-0.1.0.jar -i hamming_distance

It didn't show under the list of loaded plugins, so I restarted and there it was:
$ curl 'http://localhost:9200/_cat/plugins?v' name component version type url George Washington Bridge hamming_distance NA j

And it's there in the logs:
[2016-02-08 16:38:15,745][INFO ][plugins ] [George Washington Bridge] loaded [mapper-attachments, marvel, knapsack-1.7.2.0-954d066, **hamming_distance**, cloud-aws], sites [marvel, bigdesk]

BUT when I try to run it:

curl -XPOST 'http://localhost:9200/scf/_search?pretty' -d '{
  "query": {
    "function_score": {
      "query":{
        "match_all": {}
      },
      "functions": [
        {
          "script_score": {
            "script": "hamming_distance",
            "lang" : "native",
            "params": {
              "param_hash": "b212f1007190e430d634d6324a3a4b3a4d6a6758403c455841cc11cd30c7325b",
              "param_field": "dhash"
            }
          }
        }
      ]
    }
  }
}'

I get an error:
ElasticsearchIllegalArgumentException[Native script [hamming_distance] not found]

Any ideas?

This is what my es-plugin.properties looks like:
plugin=com.cameraforensics.elasticsearch.plugins.HammingDistancePlugin
name=hamming_distance
version=0.1.0


(Nathan Trevivian) #7

PS: I also have a groovy script in /config/scripts called hamming_distance.groovy. Could they be conflicting?


(Nathan Trevivian) #8

I've resolved this now.

I actually needed to add an entry into the elasticsearch.yml file referencing my NativeScriptFactory implementation, like so:

script.native:
  mynativescript.type: org.spacevatican.elasticsearchexample.CustomScriptFactory

as mentioned here:
http://www.spacevatican.org/2012/5/12/elasticsearch-native-scripts-for-dummies/

On the initial data set, it appears to be faster by orders of magnitude, which is nice.
Now I just need to test against a few more tens-of-millions.

Thanks again for your help!


(Ivan Brusic) #9

Sorry if I did not catch your response, but I'm glad you found it. Doc
values are not enabled by default in 1.x. It is worth giving a shot. They
might be slower to load (and index), but can be faster depending on your
types of lookup.

Ivan


(system) #10