ElasticSearch 5.5 two phase search


(Dominik Safaric) #1

We are currently in the process of implementing an API for searching related images. The search operates upon feature vectors. In order to determine the level of similarity between two images we calculate the Hamming distance between a set of feature vectors comprised of 2000 bits i.e. 250 bytes. These feature vectors are stored as binary types in ElasticSearch. Currently we aim at implementing a custom native plugin that sorts documents based on their value of Hamming distance in respect to the input feature vector. The complexity of the function calculating the Hamming distance equals O(N), i.e. no further optimizations can be made in this regard. An example query is as follows:

query = {
    "query": {
        "match_all": {}
    },
    "sort": {
        "_script": {
            "type": "number",
            "script": {
                "lang": "sds_scripts",
                "source": "hamming_distance",
                "params": {
                    "lookup": random_descriptor_blob(2000),
                    "field": "descriptor"
                }
            }
        }
    }
}

We've tested the performance of the sort using in total 1M documents and 2000 bit feature vectors. The tests were run on a single node cluster, with the number of shards equal to 1 and zero replicated shards. The mean execution time based on 100 query executions equals to 350ms. Unfortunately, we are not satisfied with this performance, but instead of seek to execute the query bellow the set threshold of 100ms. Hence, my questions are:

  1. How can the entire sort process be further optimized using the current implementation?
  2. Another possible solution would be implementing a TwoPhaseIterator, which would first approximate the matches using 64 bit feature vectors, and verify these using 2000 bit feature vectors. Does ElasticSearch support this kind of extensions, if yes, any documentation or examples?

(Ryan Ernst) #2

Can you share the source of your plugin? There are some traps in the scripting API which you may be hitting. Note that the scripting API is reworked with 5.6/6.0, which can help make these a bit more efficient.


(Dominik Safaric) #3

@rjernst Unfortunately I can't share the entire source code of the plugin, hence I'm attaching the LeafSearchScript only. But, the plugin is fully compliant with the newest API changes. That is, the ScriptPlugin instance instantiates a ScriptEngineService, the ScriptEngineService compiles the target script, and reuses it upon search. Then, the SearchScript instance implementing the getLeafSearchScript instantiates an instance of a LeafSearchScript.

final class LeafSearchScript implements org.elasticsearch.script.LeafSearchScript {

        private final LeafSearchLookup leafSearchLookup;

        public LeafSearchScript(LeafSearchLookup lookup) {
            this.leafSearchLookup = lookup;
        }

        @Override
        public void setDocument(int doc) {
            if (this.leafSearchLookup == null) {
                throw new NullPointerException();
            }
            else {
                this.leafSearchLookup.setDocument(doc);
            }
        }

        @Override
        public double runAsDouble() {
            return this.runAsLong();
        }

        @Override
        public long runAsLong() {
            LeafDocLookup doc = leafSearchLookup.doc();
            if (doc == null) {
                throw new NullPointerException();
            }
            BytesRef bytesRef = ((ScriptDocValues.BytesRefs)doc.get(field)).getValue();
            if (bytes.length != bytesRef.length) {
                return -1;
            }
            long distance = 0;
            for (int i=0;i<bytes.length;i++) {
                distance+=Integer.bitCount(0xff & (bytes[i] ^ bytesRef.bytes[i]));
            }
            return distance;
        }
    }

(Ryan Ernst) #4

Have you considered casting the bytes into a long 4 or 8 at a time (eg using ByteBuffer)? Then you can do the xor on longs/ints, and so you do less bit ops? That would be more efficient if you padded your bytesref/bytes param to be multiples of 4 or 8.


(Dominik Safaric) #5

Because wrapping a byte array into a ByteBuffer would slow down the computation, and secondly because I am dealing with arrays comprised of 250 bytes, hence the value would be out of range.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.