Matching integers in an array

My project is mainly focused on comparing the results of past queries with the current query. Each query returns a list (array) of IDs (alphanumeric values). The number of IDs depends upon the type of the query. Since the size of each query is variable therefore I am using locality sensitive hashing (minhashing) to convert the list of IDs into a fixed list of hash signatures (32-bit integers).

For example, if I have three queries Q1, Q2 and Q3 that returns the list of IDs with size 54, 76 and 200 respectively. I am converting them into hash signatures of size 40 each. The reason for converting into hash lists is to trace the similarity between the queries signature by signature. If any current query is similar to a past query by say 90%, then the system will reject the query.

Since the number of queries expected to be hundreds of thousands, therefore, Elasticsearch seems like an effective option for storing these signatures and finding the similarity value. The mapping I am thinking of currently is (where signs is an array of integers):

{
"mappings": {
"properties" : {
"signs" : {
"type": "integer" }
}}}

For example, I have these two old queries signatures stored in ES:

PUT old/signatures/1
{
"signs" : [2140511, 44805737, 127503063, 60153239, 117800107, 59420857]
}

PUT old/signatures/2
{
"signs" : [60153239, 117800107, 59420857, 7731079, 91755054, 22981500]
}

And I get a new query whose signatures are [53442323, 44805737, 127503063, 60153239, 117800107, 59420857]. This new query has 5 matching signatures with query stored in id=1. So with the matching score generated from ES, I can decide whether to accept the query or reject the query.

Is this array matching possible in ES?

First, I would advise to map these numbers as keyword rather than integer. The reasoning is that numeric types in Elasticsearch are optimized for range queries and a bit slower at exact queries that keyword fields. I suspect you will only be running exact queries on these numbers, so keyword is a better fit.

Is the order important?

If it is not, then you could just run a bool query on the sign field:

GET index/_search
{
  "query": {
    "bool": {
      "should": [
        { "constant_score": { "match": { "signs": 53442323 } } },
        { "constant_score": { "match": { "signs": 44805737 } } },
        { "constant_score": { "match": { "signs": 127503063 } } },
        { "constant_score": { "match": { "signs": 60153239 } } },
        { "constant_score": { "match": { "signs": 117800107 } } },
        { "constant_score": { "match": { "signs": 59420857 } } }
      ]
    }
  }
}

If the order is important, you might want to give each coordinate a different field name or value prefix so that matching cannot happen across different indices of your array.

Thanks Adrien for your prompt reply. Yes the order is important because each signature is a result of a specific hash function say h_i. So if I have 40 signaures s1, s2, ..., s40 ; they are due to hash functions h1, h2, ..., h40.

That is why initially I was thinking about a list of integers so that each position/index is compared. If I understood correctly the second part of your reply (order is important), should I store each signature in a different field like this:

PUT old/signatures/1
{
 “sign1” : 2140511,
 “sign2” : 44805737, 
 “sign3” : 127503063, 
 “sign4” : 60153239, 
 “sign5” : 117800107, 
 “sign6” : 59420857
}

And then use the same bool query as you mentioned above?

Correct.

You can even configure a minimum_should_match if you want at least a certain number of matches. https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-minimum-should-match.html

GET index/_search
{
  "query": {
    "bool": {
      "should": [
        { "constant_score": { "match": { "sign1": 53442323 } } },
        { "constant_score": { "match": { "sign2": 44805737 } } },
        { "constant_score": { "match": { "sign3": 127503063 } } },
        { "constant_score": { "match": { "sign4": 60153239 } } },
        { "constant_score": { "match": { "sign5": 117800107 } } },
        { "constant_score": { "match": { "sign6": 59420857 } } }
      ]
    }
  }
}
1 Like

I tried running the above query and got this error:

RequestError: TransportError(400, 'parsing_exception', '[constant_score] query does not support [match]')

I am using python client for ES:

https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.search

Am I doing something wrong?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.