I've tried a few times to write a scoring function that scales into millions/billions of documents. At the moment, they don't even scale in the millions realm.
The latest one was a simple scoring script that looked something like this:
if(doc[param_field].value == null){
return 0;
} else {
BigInteger left = new BigInteger(doc[param_field].value, 16);
BigInteger right = new BigInteger(param_hash, 16);
return 400 - left.xor(right).bitCount();
}
This might seem a little bit naive as a script (I'm kinda new to this), but as you can see: It's essentially trying to calculate the hamming distance between a given parameter and a document field, and then score appropriately.
The "400" bit gives you some idea as to how long these fields are expected to be (not very long, remembering that this a bit count).
At the moment this takes far too long to run (~4s) on a document set of 17k. In my experience with my previous attempt at a scoring script (that one was calculating euclidean distance), the execution time gets larger as the document base grows.
My question is really around the scalability of scoring function scripts.
How can I write one that scales?
Why is mine so slow?
Should I expect it to be slow?
If anyone could point me towards any resources advising on scoring script optimisation etc that would be great!
Thanks, Ivan. That is true for v 2, but not for v1.7.3 unfortunately. At least, it doesn't have it in the docs for my version.
I could upgrade, but I want to be sure it'll be worth it.
To answer your question, param_hash is a hex encoded binary string, much like my param_field value. The strings are actually phashes or dhashes of images (the docs in the es index represent images), and in comparing hamming distance I'm comparing image similarity.
Where is param_hash coming from? I cannot tell from your script. You should
use stored fields and not source (which is what it appears that you are
doing). Are doc_values enabled on those fields?
Re: scripting in 1.x - didn't see that there! The docs for 2.2 are definitely clearer in this area. Do the docs still apply to 1.x? It's easier to follow instructions when doing something new for the first time...
To answer your question re:param_hash: it's a parameter that I pass in with the scoring script info, for example:
I can't store this parameter in the document, because it IS a query-time parameter.
Re: doc_values being enabled, I don't believe they are. I thought I had set these fields as being "not_analyzed", which would make doc_values set to true by default, but as I've just checked my mappings I can see that they're not set like that after all. So no, doc_values are not enabled. But doc_values are slower than in memory fielddata, right?
OK, so I've written the native script, and installed it: sudo bin/plugin -u file:///usr/local/bin/elasticsearch-1.7.3/plugins/cameraforensics/cf-elasticsearch-plugins-hammingdistance-0.1.0.jar -i hamming_distance
It didn't show under the list of loaded plugins, so I restarted and there it was: $ curl 'http://localhost:9200/_cat/plugins?v' name component version type url George Washington Bridge hamming_distance NA j
And it's there in the logs: [2016-02-08 16:38:15,745][INFO ][plugins ] [George Washington Bridge] loaded [mapper-attachments, marvel, knapsack-1.7.2.0-954d066, **hamming_distance**, cloud-aws], sites [marvel, bigdesk]
I get an error: ElasticsearchIllegalArgumentException[Native script [hamming_distance] not found]
Any ideas?
This is what my es-plugin.properties looks like:
plugin=com.cameraforensics.elasticsearch.plugins.HammingDistancePlugin
name=hamming_distance
version=0.1.0
Sorry if I did not catch your response, but I'm glad you found it. Doc
values are not enabled by default in 1.x. It is worth giving a shot. They
might be slower to load (and index), but can be faster depending on your
types of lookup.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.