Vector Scoring

Hi guys,
I want to run K-Nearest-Nighbors on feature-vectors stored at ES.
I wrote a plugin based on this plugin: https://github.com/MLnick/elasticsearch-vector-scoring
I'm seeking to improve the query performance. right now it takes ~1 second - I wish to make it 10 times faster.

Details:
I have 2M documents in the index
each document contains a 64 dimensions floats vector in a field named "embedding"
this is it's mapping:

"analysis": {
"analyzer": {
"payload_analyzer": {
"filter": "delimited_payload_filter",
"tokenizer": "whitespace",
"type": "custom"
}
}
}

"mappings": {
"properties": {
"embedding": {
"analyzer": "payload_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
}
}
}

This is the search query:
{
"size": 100,
"query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": "payload_vector_score",
"lang": "native",
"params": {
"field": "embedding",
"cosine" : false,
"vector": [-0.06555712223052979 ,0.0639316588640213 ,-0.1625019609928131 ,-0.049717679619789124 ,-0.08388650417327881 ,-0.05376458540558815 ,-0.21441558003425598 ,0.14069288969039917 ,0.028580941259860992 ,0.07442957907915115 ,-0.19108714163303375 ,-0.10003119707107544 ,0.034126054495573044 ,-0.11807726323604584 ,0.04761182889342308 ,0.004601459950208664 ,-0.12167082726955414 ,0.2301076203584671 ,-0.005734231788665056 ,0.016479089856147766 ,0.025114329531788826 ,-0.015090115368366241 ,0.005890047177672386 ,-0.04142259433865547 ,0.15503185987472534 ,0.09912215173244476 ,0.1551043689250946 ,0.14985895156860352 ,0.2064201831817627 ,-0.1238853856921196 ,0.04467460513114929 ,-0.061931200325489044 ,-0.04865756630897522 ,-0.009241082705557346 ,-0.19579431414604187 ,0.21952545642852783 ,0.1435101181268692 ,-0.2241126447916031 ,0.08423150330781937 ,-0.11718004941940308 ,0.01940910331904888 ,-0.09160779416561127 ,0.1686438024044037 ,0.1839606910943985 ,0.1823773831129074 ,0.07107185572385788 ,0.1360888034105301 ,0.21161314845085144 ,-0.009615485556423664 ,0.08052477240562439 ,-0.1621086150407791 ,-0.037252187728881836 ,-0.0528680719435215 ,-0.07718119770288467 ,-0.05522914603352547 ,-0.24222344160079956 ,0.052051275968551636 ,-0.10451067239046097 ,0.09648159146308899 ,0.11125080287456512 ,-0.2878655791282654 ,-0.10746297240257263 ,0.04359650984406471 ,0.11088574677705765]
}
}
}
}
}

I tried several other ways to achieve this, but al were slower/failed:
Try 1 - failed
Store the vector as an array of floats.
it failed - since ES index does not keep the order of the elements in the vector. and using _source was (of course) very slow

Try 2 - slower than the above
Store the vector as comma separated string
wrote a scoring plugin which converts the string to a vector in a runtime.

Am I approaching this wrong? is there any optimization that can be done?

Thanks in advance,
Lior

I would try storing the data as a binary field:
https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html

However, I don't think you are going to get a 10x speedup. You are doing an expensive computation inside an already tight loop. One thing to consider is not using IndexLookup. It is going away. Instead, use the Lucene api's directly. This will be documented better soon , but will still require learning about Lucene. Specifically, you should look at LeafReader.getSortedBinaryDocValues. Note that the "sorted" won't matter here for you, because you should encode the entire 64 element array into a single value.

Thanks Ryan!

I'll try using the binary field - might be better since it occupies less memory in the Lucene index and accessing the vector members should be much faster than an indexLookup.

I read your PR , in order to understand how to access BinaryDocValues from an ES plugin.
you seem to do it by defining a new scripting language and by using something called ScriptEngine

  1. I would sure use an example how it all tunes together
  2. can I do it on ES 2.4.4 ?

Thanks in advance,
Lior

The example from my PR is here (which is now live was a simple scoring example. It doesn't actually read any doc values, but it does show how to create a script which uses the Lucene api. You can see in the example I get the LeafReader (which is what you need in order to get BinaryDocValues) from the LeafReaderContext.

you seem to do it by defining a new scripting language and by using something called ScriptEngine

A ScriptEngine is a script language implementation. But a "language" is a loose term, and in fact all that "native" scripts which you use now do is provide a very thin script engine which calls your NativeScriptFactory when ScriptEngine.compile is called, and call newScript when either ScriptEngine.search or ScriptEngine.executable is called. So writing a script engine will be very similar to the NativeScriptFactory you already have.

can I do it on ES 2.4.4 ?

Yes, script engines have been around since scripting was added to ES. The interface name is slightly different before 6.0 (ScriptEngineService), but works essentially the same.

I would sure use an example how it all tunes together

Some steps for you:

  1. Convert your NativeScriptFactory to a ScriptEngineService. The compile method should contain conditionals looking for the name of the script in the "source", just like is done in the example.
  2. Convert your client code to use a script language of whatever you specified in your engine.
  3. Modify the LeafSearchScript to call context.reader().getBinaryDocValues to get an accessor for doc values on your binary field (on construction of the leaf script, just like the example does getting a postings accessor). I misspoke in my earlier comment when i said there was a getSortedBinaryDocValues.
  4. Call advanceExact on the doc values iterator for each call to setDocument on the leaf search script (make sure to note the return value, since there might not actually be a value for the given document).
  5. Within runAsDouble(), call binaryValue() on your iterator (but only if advanceExact returned true). This is now where it gets tricky. The binary field type in ES does its own encoding of the binary values (to allow for multiple distinct values, vs lucene which only has a single binary value concept), and there isn't any existing code to decode this (at least not that is easily accessible from scripts). So, you need to decode your one value on your own. Get your BytesRef from the doc values iterator, and create a ByteArrayDataInput. Call readVInt() on that, and it should return 1 (the number of values you have), then call readVInt() again, and it should be the number of bytes remaining. Call getPosition() on your data input, and that is the index into your BytesRef.bytes where your data starts. Then you can decode the bytes for each float (you might try creating a ByteBuffer at that offset, then converting to a FloatBuffer, and using it's get(index) method to access each of your values).

I hope that helps!

1 Like

Thank Ryan! I'll try it out

Well Ryan - You're a champ!
The binary approach is 7-8 times faster than my original approach.

I did it a little different:
in my ScriptEngineService I injected the BinaryDocValues reader (for my binary field) to the plugin, see setBinaryEmbeddingReader:

public SearchScript search(CompiledScript compiledScript, final SearchLookup lookup, @Nullable final Map<String, Object> vars) {
        final NativeScriptFactory scriptFactory = (NativeScriptFactory) compiledScript.compiled();
        final VectorScoreScript script = (VectorScoreScript) scriptFactory.newScript(vars);
        return new SearchScript() {
            @Override
            public LeafSearchScript getLeafSearchScript(LeafReaderContext context) throws IOException {
                script.setBinaryEmbeddingReader(context.reader().getBinaryDocValues(VectorScoreScript.field));
                return script;
            }
            @Override
            public boolean needsScores() {
                return scriptFactory.needsScores();
            }
        };
    }

and inside the run method in my plugin I got the byte array as follows:
final byte[] bytes = binaryEmbeddingReader.get(docId).bytes;
where docid is set at setDocument (my plugin implements LeafSearchScript)

Thanks for the help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.