ElasticSearch 5.5 two phase search

dominik.safaric · August 17, 2017, 1:25pm

We are currently in the process of implementing an API for searching related images. The search operates upon feature vectors. In order to determine the level of similarity between two images we calculate the Hamming distance between a set of feature vectors comprised of 2000 bits i.e. 250 bytes. These feature vectors are stored as binary types in ElasticSearch. Currently we aim at implementing a custom native plugin that sorts documents based on their value of Hamming distance in respect to the input feature vector. The complexity of the function calculating the Hamming distance equals O(N), i.e. no further optimizations can be made in this regard. An example query is as follows:

query = {
    "query": {
        "match_all": {}
    },
    "sort": {
        "_script": {
            "type": "number",
            "script": {
                "lang": "sds_scripts",
                "source": "hamming_distance",
                "params": {
                    "lookup": random_descriptor_blob(2000),
                    "field": "descriptor"
                }
            }
        }
    }
}

We've tested the performance of the sort using in total 1M documents and 2000 bit feature vectors. The tests were run on a single node cluster, with the number of shards equal to 1 and zero replicated shards. The mean execution time based on 100 query executions equals to 350ms. Unfortunately, we are not satisfied with this performance, but instead of seek to execute the query bellow the set threshold of 100ms. Hence, my questions are:

How can the entire sort process be further optimized using the current implementation?
Another possible solution would be implementing a TwoPhaseIterator, which would first approximate the matches using 64 bit feature vectors, and verify these using 2000 bit feature vectors. Does ElasticSearch support this kind of extensions, if yes, any documentation or examples?

rjernst · August 17, 2017, 5:58pm

Can you share the source of your plugin? There are some traps in the scripting API which you may be hitting. Note that the scripting API is reworked with 5.6/6.0, which can help make these a bit more efficient.

dominik.safaric · August 17, 2017, 7:43pm

@rjernst Unfortunately I can't share the entire source code of the plugin, hence I'm attaching the LeafSearchScript only. But, the plugin is fully compliant with the newest API changes. That is, the ScriptPlugin instance instantiates a ScriptEngineService, the ScriptEngineService compiles the target script, and reuses it upon search. Then, the SearchScript instance implementing the getLeafSearchScript instantiates an instance of a LeafSearchScript.

final class LeafSearchScript implements org.elasticsearch.script.LeafSearchScript {

        private final LeafSearchLookup leafSearchLookup;

        public LeafSearchScript(LeafSearchLookup lookup) {
            this.leafSearchLookup = lookup;
        }

        @Override
        public void setDocument(int doc) {
            if (this.leafSearchLookup == null) {
                throw new NullPointerException();
            }
            else {
                this.leafSearchLookup.setDocument(doc);
            }
        }

        @Override
        public double runAsDouble() {
            return this.runAsLong();
        }

        @Override
        public long runAsLong() {
            LeafDocLookup doc = leafSearchLookup.doc();
            if (doc == null) {
                throw new NullPointerException();
            }
            BytesRef bytesRef = ((ScriptDocValues.BytesRefs)doc.get(field)).getValue();
            if (bytes.length != bytesRef.length) {
                return -1;
            }
            long distance = 0;
            for (int i=0;i<bytes.length;i++) {
                distance+=Integer.bitCount(0xff & (bytes[i] ^ bytesRef.bytes[i]));
            }
            return distance;
        }
    }

rjernst · August 17, 2017, 8:06pm

Have you considered casting the bytes into a long 4 or 8 at a time (eg using ByteBuffer)? Then you can do the xor on longs/ints, and so you do less bit ops? That would be more efficient if you padded your bytesref/bytes param to be multiples of 4 or 8.

dominik.safaric · August 18, 2017, 7:39am

Because wrapping a byte array into a ByteBuffer would slow down the computation, and secondly because I am dealing with arrays comprised of 250 bytes, hence the value would be out of range.

system · September 15, 2017, 7:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Scoring Function Scalability - v1.7.3 Elasticsearch	9	1794	July 5, 2017
Problem Computing Euclidean Distance using script score query Elasticsearch	8	2302	October 15, 2018
What is the best scheme for similarity search based on binary codes with Elasticsearch ? (Context is image search with deep nets) Elasticsearch	5	6160	July 5, 2017
Vector Scoring Elasticsearch	6	6124	June 18, 2017
ALgorithm in ElasticSearch for similarity distances between 2 floating vectors Elasticsearch	2	582	February 12, 2021

ElasticSearch 5.5 two phase search

Related topics