Hello I am doing reverse image search using elasticsearch. I have hashes stored in index and now I am trying to find similar hashes(to compensate compression and what not) using Query String fuzziness.
my code for search is:
var searchResponse = await Program._elasticclient.SearchAsync<IndexedImage>
(
s => s.Index("images").Query(q => q.QueryString(qs => qs.FuzzyMaxExpansions(150).Fuzziness(Fuzziness.EditDistance(150)).Fields(f => f.Field(ff => ff.imagehash)).Query(imagehash))).Size(10000)
).ConfigureAwait(false);
imagehash field is string holding hash(~600 character long number)
When I am trying to find similar strings it works fine but it's not working well at some cases and I am wondering whats wrong.
For example, this is original hash and when searching it returns 2 results from DB:
222230302101000014343434014133341141303411413033013140410000000304142022222214134313414240103030010130300101101003031110130314133303434110102222222243431000030024241411432324214232141143133431411144421000030034042222222214111333121143232223323211113333333331211311333344442000444340002222222201013434222032323130110111103303323011011110330334330101343111112222222231333111010244440400444434304131141343134444400044441030131133432222222233311131101423130001444401044444434400404444003041410020000034242222222201003414434300000000343430310131303201013131010102003434121143032222224203034444444410002123421201033444343421413434314101044444444230402222
This is hash of similar image(levenshtein distance is 52) and when searched db returns 0 results:
222230302101000024343434014133341141303411413033013140410000000204242022222214134313414140103030010130300101101003031110130314133303424120102222222243431000030034241411432324214222141143133431411144421000030034042222222214111333121243232223323212113333333331211312333344432010444340002222222201013434323021312120120211103303323011011110230333331101343111112222222231333111010244440400444434304131141443134444400044441030131133432222222233311131101423130001444401044444434400404444003041410030000034242222222201003414434300000000343430310131303201013131010101003434222142022222224203024444444410002123421211133343343421413434314101044444444230402222
When I took original hash and replaced bunch of numbers with same amount of 9s I got levenshtein distance 59 and search returned 2 results as original. hash:
222230302101000014343434014133341141303411413033013140410000000304142022222214134313414240103030010130300101101003031110130314133303434110102222222243431000030024241411432324214232141143133431411144421000030034042222222214111333121143232223323211113333333331211311333344442000444340002222222201013434222032323130110199999999999999999999999999999999999999999999999999999999190244440400444434304131141343134444400044441030131133432222222233311131101423130001444401044444434400404444003041410020000034242222222201003414434300000000343430310131303201013131010102003434121143032222224203034444444410002123421201033444343421413434314101044444444230402221
all hashes I have stored and ones I am searching are always same in length.
Similarity can be compared properly here: https://countwordsfree.com/comparetexts
Can anyone point me to right direction and tell me what am I doing wrong here? Thanks in advance