Best way to search for extremely long hash-code similarity?


We are a group of students that uses ElasticSearch to search for possible matches between stored hash-codes (or numeric values) in an index and a query with a hash-code. As we are totally new to ElasticSearch, we were wondering what would be the best way to search for hash-codes or very long numeric values? At the moment we search through the following query and store the hash-code as a string:

(Please tell me, if you do not understand the lamda expressions/C# code)
var searchResponse = _client.Search(s => s
.Query(qu => qu
.Match(m => m
.Field(f => f.Fp)

This query does work and we do find the right matches, but it tends to be rather slow. Properly due to the length of the hash-code being normally between 40.000 - 110.000 digits. This yields one particular problem with our query other than being slow that the maxclausecount exceeds 1024. Changing this property results in extremely slow response times.

Please notice that we at the moment have around 200.000 fingerprints in a single index with a total size of ~50GB. Why we are asking is due to the fact that we are hitting performance issues.
The index has 5 shards and 1 replica, 30GB of ram with a heap size of 13 GB and a very fast 500 GB HDD.


Hi Simon,
The initial question is why are your hashcodes so long? Hashcodes are normally much smaller.


Sorry for the confusion. So let's skip the hash-code synonym for a bit a go along with the idea of it being a numeric value divded into segments which represents 371ms of audio. We take the 371ms of audio and convert it into numeric value (implementation being irrelevant) and concatenate this value as we progress through the song. We then store this value in ElasticSearch. When for instance a radio plays, we take a segment of that radio, turn it into numeric values and searches for it in ElasticSearch to find potential matches.

Ah OK. Shazam type deal.
I'm vaguely aware of a number of Lucene-based projects that do this (a quick Google gave me this one).

Rather than concatenating these values into a single large value don't these Lucene-based systems use the positional queries in Lucene (eg span) to find sequences of a smaller numbers?