Alternative to Fuzzy Search for pHash

Hello,
I'm calculating image signatures as pHash, which look like this:
111000100110011111110001000000000101001101100001001010101111101110100001000100110111000101010010011011111000101100100010111011110110010111001100011101000110001011100001100011010000100100000100110101111011001110001001111001100
Storing this as a string and performing fuzzy_like_this_field query works just fine:

{
   "query": {
      "fuzzy_like_this_field": {
         "phash": {
            "like_text": "011000100110011111110001000000000101001101100001001010101111101110100001000100110111000101010010011011111000101100100010111011110110010111001100011101000110001011100001100011010000100100000100110101111011001110001001111001100",
            "fuzziness": 0.85
         }
      }
   }
}

I'm getting result as desired, closer zero/one sequences returned first.
Recently, I found out that fuzzy_like_this_field is removed from 2.0, and I'm trying to use match query with fuzziness attribute as suggested, but it doesn't work same way, maybe because I'm still using 1.7 for the match query:

{
   "query": {
      "match": {
         "phash": {
            "query": "111000100110011111110001000000000101001101100001001010101111101110100001000100110111000101010010011011111000101100100010111011110111010111001100011101000110001011100001100011010000100100010100110101111011001110001001111001100",
            "fuzziness": "0.85"
         }
      }
   }
}

First of all, if anybody knows why fuzziness works differently, please let me know. But, most importantly, I've read some articles about Fuzzy Queries in Lucene, and it says - more different phash values I'll have, slower the query be. I have dozens of millions of pictures, and expecting more. Looks like I could ended up with very slow query at the end.
Is any way to retrieve the documents with similar field using some other query which wouldn't degradate the performance as much?

Thank you,
Eugene

Hi Eugene,

I think a better way to solve this problem would be to use n-grams. This way elasticsearch would return documents that have the most common sub-strings with your query.

1 Like

Adrien is right, n-grams come to mind first for me. The fuzzy queries have their limits .

In addition to n-grams, I'll just point out you can turn just about anything into tokens. As an example, here's something crazy/stupid I did out of complete ignorance for how image search actually should work. It just parses an image and emits tokens based on the average RGB for every n-th pixel. I quickly ran into the maximum number of boolean clauses you could use in a search. So absolutely I wouldn't do this in production. But maaybe with something like pHash it becomes an option.

Thanks for the tip. I'm trying n-grams right now. Looks promising.
Also, as you see, the pHash is just a 256 long binary. Another alternative I was thinking about is to index it not as a string of zero and one, but as a binary number to save space.
But I couldn't find the way to run such searches to compare binaries.
Technically I's need to apply XOR operation and more zeros the result has, higher the score should be.
I guess, this is already too much to ask. But since Doug said that we can tokenize anything, why not.
If you have any idea on this regard, please let me know.
Thanks,
Eugene

Basically under the hood, Lucene works really well with either prefix queries or exact term queries. In fact if you read more about the fuzzy query, its trying to expand your fuzzy query into many terms that reflect N edits away from your query string. It's also trying to prod you into having a shared prefix.

So if you present Lucene with the right tokens at index/search time for either exact or prefix match, you can have really tight control over the vector similarity measurement that happens. You can control that either through what I did in that image search hack job (turn pixels into words that can get consumed into tokens) or through controlling the analysis process within Lucene. This article might help you get some context. So does Chapter 4 of my book.

Hope that helps!

Thanks,
nGrams didn't work for me, so I ended up using your approach, tokenize my long string of zeros and ones into boolean multivalued field, and using bool->should query to retrieve the result. The score shows how close the given pHash to the indexed ones.
The only problem I had, there is no way to set position of the multivalued field I'm matching against. So I had to actually create as many fields like this {pos0:true, pos1:false, ... pos255:true}, now I could run bool should query:

"query": {
      "bool": {
         "should": [
            {
               "term": {
                  "pos0": true
               }
            },
            {
               "term": {
                  "pos1": true
               }
            },
--- SNIP ---

I wish I could specify in which position of the multivalued field the match should happen, but looks like it is not possible for now.
Anyway, it works fine,
Thanks for the ideas.
Eugene