My project is mainly focused on comparing the results of past queries with the current query. Each query returns a list (array) of IDs (alphanumeric values). The number of IDs depends upon the type of the query. Since the size of each query is variable therefore I am using locality sensitive hashing (minhashing) to convert the list of IDs into a fixed list of hash signatures (32-bit integers).
For example, if I have three queries Q1, Q2 and Q3 that returns the list of IDs with size 54, 76 and 200 respectively. I am converting them into hash signatures of size 40 each. The reason for converting into hash lists is to trace the similarity between the queries signature by signature. If any current query is similar to a past query by say 90%, then the system will reject the query.
Since the number of queries expected to be hundreds of thousands, therefore, Elasticsearch seems like an effective option for storing these signatures and finding the similarity value. The mapping I am thinking of currently is (where signs is an array of integers):
{
"mappings": {
"properties" : {
"signs" : {
"type": "integer" }
}}}
For example, I have these two old queries signatures stored in ES:
PUT old/signatures/1
{
"signs" : [2140511, 44805737, 127503063, 60153239, 117800107, 59420857]
}
PUT old/signatures/2
{
"signs" : [60153239, 117800107, 59420857, 7731079, 91755054, 22981500]
}
And I get a new query whose signatures are [53442323, 44805737, 127503063, 60153239, 117800107, 59420857]. This new query has 5 matching signatures with query stored in id=1. So with the matching score generated from ES, I can decide whether to accept the query or reject the query.
Is this array matching possible in ES?