We are currently in the process of implementing an API for searching related images. The search operates upon feature vectors. In order to determine the level of similarity between two images we calculate the Hamming distance between a set of feature vectors comprised of 2000 bits i.e. 250 bytes. These feature vectors are stored as binary types in ElasticSearch. Currently we aim at implementing a custom native plugin that sorts documents based on their value of Hamming distance in respect to the input feature vector. The complexity of the function calculating the Hamming distance equals O(N), i.e. no further optimizations can be made in this regard. An example query is as follows:
query = {
"query": {
"match_all": {}
},
"sort": {
"_script": {
"type": "number",
"script": {
"lang": "sds_scripts",
"source": "hamming_distance",
"params": {
"lookup": random_descriptor_blob(2000),
"field": "descriptor"
}
}
}
}
}
We've tested the performance of the sort using in total 1M documents and 2000 bit feature vectors. The tests were run on a single node cluster, with the number of shards equal to 1 and zero replicated shards. The mean execution time based on 100 query executions equals to 350ms. Unfortunately, we are not satisfied with this performance, but instead of seek to execute the query bellow the set threshold of 100ms. Hence, my questions are:
- How can the entire sort process be further optimized using the current implementation?
- Another possible solution would be implementing a TwoPhaseIterator, which would first approximate the matches using 64 bit feature vectors, and verify these using 2000 bit feature vectors. Does ElasticSearch support this kind of extensions, if yes, any documentation or examples?