Hamming Distance on Binary Strings - Latest

Hello,

I want to do hamming distance on binary strings in elasticsearch, and I want to control the distance.
In my case, the binary strings have a length of 256 and I want everything with a hamming distance of 32 or below.

Is this possible using fuzzy searching? Looking at the docs, the fuzziness parameter for fuzzy queries only accepts values; 0, 1, 2 or AUTO.

Apparently with AUTO I can set low and high distance values, so perhaps: AUTO:0,32 would work?

Can someone confirm this? Is anyone else doing this? If so, how?

I've done some experiments, but aren't getting quite the results I expected.

I have several hashes, with varying hamming distances from my query hash, but results for the query:

{
  "query": {
    "fuzzy": {
      "hash.keyword": {
        "value": "01010011...etc...10010",
        "fuzziness": "AUTO:0,32",
        "max_expansions": 50,
        "prefix_length": 0,
        "transpositions": true
      }
    }
  }
}

only returns results where the hamming distance <= 2.

Is that expected?

Yes, that is expected. AUTO adjusts the distance based on the length of the string, but the maximum is still 2.

Thanks for the response.

Is there a reason why the limitation is 2? Is there any way to do what I want to do with elasticsearch?

I believe the reason for capping this is performance. I do not have any suggested way to do what you want so will leave that for others.

You may be able to do it using a script query or a runtime field, but I am not sure. this would have an impact on performance though, especially at scale.

I tried various scoring scripts, but they're not in the least bit performant.

OpenSearch has a hamming distance plugin. So it looks like I'll be migrating to that. :cry:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.