Best way to search for extremely long hash-code similarity?

SuperSimon · July 25, 2019, 10:26am

Hey!

We are a group of students that uses ElasticSearch to search for possible matches between stored hash-codes (or numeric values) in an index and a query with a hash-code. As we are totally new to ElasticSearch, we were wondering what would be the best way to search for hash-codes or very long numeric values? At the moment we search through the following query and store the hash-code as a string:

(Please tell me, if you do not understand the lamda expressions/C# code)
var searchResponse = _client.Search(s => s
.From(0)
.Size(10)
.Query(qu => qu
.Match(m => m
.Field(f => f.Fp)
.Query("hash-code")
)
)
);

This query does work and we do find the right matches, but it tends to be rather slow. Properly due to the length of the hash-code being normally between 40.000 - 110.000 digits. This yields one particular problem with our query other than being slow that the maxclausecount exceeds 1024. Changing this property results in extremely slow response times.

Please notice that we at the moment have around 200.000 fingerprints in a single index with a total size of ~50GB. Why we are asking is due to the fact that we are hitting performance issues.
The index has 5 shards and 1 replica, 30GB of ram with a heap size of 13 GB and a very fast 500 GB HDD.

Thanks!

Mark_Harwood · July 25, 2019, 10:35am

Hi Simon,
The initial question is why are your hashcodes so long? Hashcodes are normally much smaller.

SuperSimon · July 25, 2019, 12:53pm

@Mark_Harwood

Sorry for the confusion. So let's skip the hash-code synonym for a bit a go along with the idea of it being a numeric value divded into segments which represents 371ms of audio. We take the 371ms of audio and convert it into numeric value (implementation being irrelevant) and concatenate this value as we progress through the song. We then store this value in ElasticSearch. When for instance a radio plays, we take a segment of that radio, turn it into numeric values and searches for it in ElasticSearch to find potential matches.

Mark_Harwood · July 25, 2019, 1:05pm

Ah OK. Shazam type deal.
I'm vaguely aware of a number of Lucene-based projects that do this (a quick Google gave me this one).

Rather than concatenating these values into a single large value don't these Lucene-based systems use the positional queries in Lucene (eg span) to find sequences of a smaller numbers?

system · August 22, 2019, 1:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hash matching queries Elasticsearch	5	2699	July 6, 2017
How to imporove poor performance for searching Elasticsearch	4	469	January 3, 2020
Fuzzy search question Elasticsearch	8	1173	May 23, 2020
Terms aggregations on hashcodes (Murmur3FieldMapper) Elasticsearch	5	1002	July 5, 2017
Matching integers in an array Elasticsearch	5	9333	September 13, 2017

Best way to search for extremely long hash-code similarity?

Related topics