We're using deep convolutional neural networks to fast process images and make the building of image similarity search engines a commodity for everyone. Our pipeline uses the neural nets to turn images into actionable binary (or float) hash codes that are then used for image similarity retrieval.
This is where the struggle begins. First there does not seem to be a built-in similary measure such as Hamming distance for binary hashes in ES, am I correct ? Similary there does not seem to be a cosine similarity or Euclidean comparison metric built into ES with application to arrays of float, is this still correct ?
I is noteworthy that using the fuzziness capability on strings of 1s and 0s almost got us where we wanted, but Levenshtein distance appears to not go above two edits, which is too low for our strings which are commonly a 1000 characters long. On the sidenote, I am interested in understanding why the cap on the Levenshtein distance ?
Any ES expert out there who could confirm our findings above, and possibly give us leads to where to look and what best options for scalable similarity search based on binary hashes or arrays of floats ?
Hi @beniz, my colleague pointed me towards this thread. Very interesting stuff!
Correct, Elasticsearch doesn't currently offer either of these as direct similarity metrics over a binary field. When the murmur3 analyzer was added, I remember us talking about adding hamming / minhash style capabilities, but it never materialized.
It's a good question, and not entirely intuitive if you don't know the backstory. The simple explanation is "for technical reasons".
The fuzzy query (and variants, like wildcard, prefix, etc) work by generating a Finite State Automata that represents every variation of your input text within n edits. This is then intersected with the term dictionary (which is stored in a FST), and the resulting matches can be enumerated much quicker than trying the brute-force method of fuzzy evaluation.
Ultimately, the reason the edit distance is limited to 2 is because the generated FSA is already quite huge. Edits higher than 2 would start generating truly enormous FSAs, and the number of matching terms would greatly slow down the evaluation. So Lucene made the decision to just cap the limit at 2 edits.
Part 2, since the forums said I wrote too much for one post
So, I encountered this exact problem a few years ago when working on a biology demo. I wanted to do genomic sequence alignment, which is a very similar problem: given two strings of characters (constrained to 20 unique characters, one per amino acid), how similar are they?
As you discovered, hamming distances are difficult in ES. There are a few options
Write a Similarity Module and plug that into Lucene, but it is very technically challenging and invasive.
Write a Scripted Metric Aggregation (which actually didn't exist at the time) to perform the similarity in a MapReduce fashion, but this will likely be slow since it needs to evaluate the script against each doc.
Some kind of nGram scheme. I found this to be too restriction (high precision+low recall or low precision+high recall), with little ability to tune the accuracy tradeoff. It also required a lot of query gymnastics, since you had to generate the nGram fragments yourself and use phrases/spans to check for ordering
Some other wonky schemes like a Chaos Game Representation then querying by Geo-based lookup. Suffice to say it was interesting but didn't work
Ultimately, I realized that since ES lacks the appropriate similarity metrics to do binary hamming-distance-style evaluation, I had to bake some of that logic into the tokens themselves. I used a custom genomic Locality Sensitive Hash (similar to a pHash for images). Then the procedure was:
Generate a global table of random genomic trigrams. For your application, it could easily be a random list of image or pHash features, or even just random bits. They are essentially random projections into the multidimensional problem space.
At index-time, compare the target sequence against each random projection. If it matches within some threshold, emit the random projection's ordinal number as a single token. In biology we use a substitution matrix called BLOSUM, since some amino acids are better substitutions than others, to generate a score for each position. But you could simply use a hamming distance, or bitwise AND the hashes and count bits, etc. The actual scoring is relatively unimportant and domain-specific
Repeat this process for all random projections. When you're done, you have a list of projections that your target approximately matches. Index these as an array of tokens
At search time, repeat the process for your candidate sequence. Search for the resulting list of tokens. Since each token represents a random projection into the sequence space (with some fuzziness depending on scoring threshold in step 2), each matching token means your sequence is approximately and semantically similar to the target in some manner (because we used a locality sensitive hash). The more overlapping tokens, the more similar they are.
Optional step five: I performed a rescore phase afterwards, which did an exact sequence alignment, just to re-order the top n results more precisely and weed out the obviously false positives. You could do something similar with a real Hamming distance, etc.
It obviously has some issues. You need "enough" random projections to cover the search space, and you can never guarantee that every candidate is fully covered. It may return spurious results which happen to hash to similar regions but are in fact very different. And the method of hashing affects how results are grouped. But it's a decent approximation, and most importantly, is very fast (since at the ES layer it's just simple token matching).
I'm unsure how your image hashing works, but if it can emit actual features instead of collapsing them to a binary hash, you could side-step the entire random-projection issue and just index those features directly.
E.g. perhaps you can grab the weights of the final output layer, discretize them into buckets (to make tokens easier to search, as opposed to floats) and index those weights directly. That only works if outputs share semantic overlap between similar inputs, I don't know enough about deep convolutional networks to say.
Hope that helps! Would be happy to talk out any potential ideas you have, even if they are entirely unrelated to this novel that I just wrote. I really love thinking about these kinds of use-cases in ES. Goodluck!
If you're curious, this is a slide-deck from a meetup about the above system. It's a little hard to grok since the slides are more speaking aids than self-descriptive, but hopefully it helps a bit:
Hi @polyfractal, many thanks for having taken the time for the detailed answers. I got a good laugh from the one quote above, thanks for this as well It will take a bit of time to re-read this in detail, do some tests and share thoughts.