What is the best scheme for similarity search based on binary codes with Elasticsearch ? (Context is image search with deep nets)

beniz · February 27, 2016, 8:02am

Hi Elastic!

We're using deep convolutional neural networks to fast process images and make the building of image similarity search engines a commodity for everyone. Our pipeline uses the neural nets to turn images into actionable binary (or float) hash codes that are then used for image similarity retrieval.

Performances and accuracy are well above those of existing techniques such as the LIRE plugin (https://www.elastic.co/blog/found-getting-started-with-lire-and-elasticsearch). An easy to reproduce example lies here: https://github.com/beniz/deepdetect/tree/imgsearch/demo/imgsearch
Not that it does not use Elasticsearch.

However, the solution above does not scale well above a million images, and we have studied how to complement our existing work (https://www.elastic.co/blog/categorizing-images-with-deep-learning-into-elasticsearch) and use Elasticsearch in a seamless manner for image similarity search as well.

This is where the struggle begins. First there does not seem to be a built-in similary measure such as Hamming distance for binary hashes in ES, am I correct ? Similary there does not seem to be a cosine similarity or Euclidean comparison metric built into ES with application to arrays of float, is this still correct ?

We've studied and tried the methods described in http://stackoverflow.com/questions/32785803/similar-image-search-by-phash-distance-in-elasticsearch and that involve no additional code (or plugin to ES). This is because we favor 'out-of-the-box' and seamless pipelines that do not require fiddling with the inner guts of ES (or any other piece of software for what matters!).

I is noteworthy that using the fuzziness capability on strings of 1s and 0s almost got us where we wanted, but Levenshtein distance appears to not go above two edits, which is too low for our strings which are commonly a 1000 characters long. On the sidenote, I am interested in understanding why the cap on the Levenshtein distance ?

Any ES expert out there who could confirm our findings above, and possibly give us leads to where to look and what best options for scalable similarity search based on binary hashes or arrays of floats ?

Thanks in advance!

polyfractal · March 1, 2016, 3:08pm

Hi @beniz, my colleague pointed me towards this thread. Very interesting stuff!

Correct, Elasticsearch doesn't currently offer either of these as direct similarity metrics over a binary field. When the murmur3 analyzer was added, I remember us talking about adding hamming / minhash style capabilities, but it never materialized.

It's a good question, and not entirely intuitive if you don't know the backstory. The simple explanation is "for technical reasons".

The fuzzy query (and variants, like wildcard, prefix, etc) work by generating a Finite State Automata that represents every variation of your input text within n edits. This is then intersected with the term dictionary (which is stored in a FST), and the resulting matches can be enumerated much quicker than trying the brute-force method of fuzzy evaluation.

The trick is that generating a "fuzzy" FSA that accepts the appropriate edits is very challenging technically. Mike McCandless wrote up the story of how it was implemented in Lucene, it's a fun read: Changing Bits: Lucene's FuzzyQuery is 100 times faster in 4.0

Ultimately, the reason the edit distance is limited to 2 is because the generated FSA is already quite huge. Edits higher than 2 would start generating truly enormous FSAs, and the number of matching terms would greatly slow down the evaluation. So Lucene made the decision to just cap the limit at 2 edits.

If you're curious, Mike has written some more about how FSTs are used in Lucene in other places: Changing Bits: Using Finite State Transducers in Lucene

polyfractal · March 1, 2016, 3:09pm

Part 2, since the forums said I wrote too much for one post

So, I encountered this exact problem a few years ago when working on a biology demo. I wanted to do genomic sequence alignment, which is a very similar problem: given two strings of characters (constrained to 20 unique characters, one per amino acid), how similar are they?

As you discovered, hamming distances are difficult in ES. There are a few options

Write a Similarity Module and plug that into Lucene, but it is very technically challenging and invasive.
Write a Scripted Metric Aggregation (which actually didn't exist at the time) to perform the similarity in a MapReduce fashion, but this will likely be slow since it needs to evaluate the script against each doc.
Some kind of nGram scheme. I found this to be too restriction (high precision+low recall or low precision+high recall), with little ability to tune the accuracy tradeoff. It also required a lot of query gymnastics, since you had to generate the nGram fragments yourself and use phrases/spans to check for ordering
Some other wonky schemes like a Chaos Game Representation then querying by Geo-based lookup. Suffice to say it was interesting but didn't work

Ultimately, I realized that since ES lacks the appropriate similarity metrics to do binary hamming-distance-style evaluation, I had to bake some of that logic into the tokens themselves. I used a custom genomic Locality Sensitive Hash (similar to a pHash for images). Then the procedure was:

Generate a global table of random genomic trigrams. For your application, it could easily be a random list of image or pHash features, or even just random bits. They are essentially random projections into the multidimensional problem space.
At index-time, compare the target sequence against each random projection. If it matches within some threshold, emit the random projection's ordinal number as a single token. In biology we use a substitution matrix called BLOSUM, since some amino acids are better substitutions than others, to generate a score for each position. But you could simply use a hamming distance, or bitwise AND the hashes and count bits, etc. The actual scoring is relatively unimportant and domain-specific
Repeat this process for all random projections. When you're done, you have a list of projections that your target approximately matches. Index these as an array of tokens
At search time, repeat the process for your candidate sequence. Search for the resulting list of tokens. Since each token represents a random projection into the sequence space (with some fuzziness depending on scoring threshold in step 2), each matching token means your sequence is approximately and semantically similar to the target in some manner (because we used a locality sensitive hash). The more overlapping tokens, the more similar they are.
Optional step five: I performed a rescore phase afterwards, which did an exact sequence alignment, just to re-order the top n results more precisely and weed out the obviously false positives. You could do something similar with a real Hamming distance, etc.

It obviously has some issues. You need "enough" random projections to cover the search space, and you can never guarantee that every candidate is fully covered. It may return spurious results which happen to hash to similar regions but are in fact very different. And the method of hashing affects how results are grouped. But it's a decent approximation, and most importantly, is very fast (since at the ES layer it's just simple token matching).

I'm unsure how your image hashing works, but if it can emit actual features instead of collapsing them to a binary hash, you could side-step the entire random-projection issue and just index those features directly.

E.g. perhaps you can grab the weights of the final output layer, discretize them into buckets (to make tokens easier to search, as opposed to floats) and index those weights directly. That only works if outputs share semantic overlap between similar inputs, I don't know enough about deep convolutional networks to say.

Hope that helps! Would be happy to talk out any potential ideas you have, even if they are entirely unrelated to this novel that I just wrote. I really love thinking about these kinds of use-cases in ES. Goodluck!

If you're curious, this is a slide-deck from a meetup about the above system. It's a little hard to grok since the slides are more speaking aids than self-descriptive, but hopefully it helps a bit:

beniz · March 1, 2016, 4:57pm

Hi @polyfractal, many thanks for having taken the time for the detailed answers. I got a good laugh from the one quote above, thanks for this as well It will take a bit of time to re-read this in detail, do some tests and share thoughts.

polyfractal · March 5, 2016, 11:25am

Happy to help!

Topic		Replies	Views
Fuzzy search question Elasticsearch	8	1239	May 23, 2020
Is there any way for doing complicated mathematical matching of indexes while retrieval? <not just simple text comparison > Elasticsearch	3	292	July 6, 2017
ElasticSearch5系での画像検索方法を知りたい日本語による質問・議論はこちら	2	1937	July 6, 2017
Term vectors for computing document similarity Elasticsearch	7	1366	July 6, 2017
Search for similar documents Elasticsearch	4	1877	July 6, 2017

What is the best scheme for similarity search based on binary codes with Elasticsearch ? (Context is image search with deep nets)

Related topics