Problem Computing Euclidean Distance using script score query


(Vaibhav Thapliyal) #1

Hi Everyone,

I am trying to build a reverse image/Image Similarity functionality using Elasticsearch. I have successfully indexed the feature vectors in Elasticsearch as an array which looks something like this:

"feature_vector" : [157, 144, 26, 107, 97, 62, 114, 248 ........ ]

The size of this array is 256.

Now I am trying to run a Euclidean Distance formula as a script.

Here's the formula I am trying to implement:

image

Here's the script:

GET images_features/_search
{
  "sort": [
    {
      "_score": {
        "order": "asc"
      }
    }
  ],
  "query": {
    "function_score": {
      "script_score": {
        "script": {
          "lang": "painless",
          "source": "double distance = 0; double diff = 0; if(doc['feature_vector'].size() != params.query_feature.size()){distance = 0} else{for(int j = 0; j < doc['feature_vector'].size(); j++){diff = Math.abs(doc['feature_vector'][j]) - Math.abs(params.query_feature[j]); distance = distance + (diff*diff)}} return Math.sqrt(distance)",
          "params": {
            "query_feature": [
              170,
              134,
              191,
              75,
              139,
               .
               .
               .
              180,
              232,
              150,
              182,
              208,
              239,
              109,
              232,
              106
            ]
          }
        }
      }
    }
  }
}

However I am facing an issue here because the score that's being calculated is wrong and the results returned are very vague. I got the score of 2281.6973 through running this script while the java and python programs return the score of 64.093681.

I have verified the correct value of an input feature with a feature stored in Es in both python(using the scipy library) and JAVA(by writing the same script as a java program) and both of them match.

Is there any issue in the script that I am missing out??


(Vaibhav Thapliyal) #2

Hi,

I am referring this link as part of designing the overall system.

https://www.linkedin.com/pulse/hacking-elasticsearch-image-retrieval-ashwin-saval

Any help would be appreciated.


(Ryan Ernst) #3

Multiple values for doc values are stored sorted, so your order is changing, and I would expect this causes catastrophic changes to the score.

You will likely need to store your vector using a binary field. Note, however, that there is no documentation for accessing this from docvalues. IIRC, you will need to write an advanced script (ie script in java) so you can access the necessary lucene doc values classes. Take a look at how binary values are encoded in BinaryFieldMapper.CustomBinaryDocValuesField. This is the class that elasticsearch uses to encode multiple values in a binary field (which Lucene does not natively support). Whether you pass in an opaque single binary value, or separate values, you will need to decode the number of values header in the value.


(Hanish Bansal) #4

@rjernst Is there any way that multiple values of document can be stored without any sorting? As per our understanding, multiple values are stored as array type and preserve the order.


(Ryan Ernst) #5

No, this is not configurable. Lucene does not have any native non sorted, multi valued numerics. When you say "multiple values are stored as array type and preserve the order", I think you may be confusing doc values (which is how values are accessed through doc in scripts) with _source, which is the raw input json, which is kept as is.


(Vaibhav Thapliyal) #6

Hi Ryan,
Is there any other way that I can access this _source field in my script query or is this available only in the update scripts?


(Vaibhav Thapliyal) #7

I have devised a possible workaround for this problem that is working for me.

Instead of storing the values as arrays I am now storing them as comma separated text in a keyword field. In the same way I have changed the input query param too. They both look something like this:

4.0,96.0,159.0,120.0,234.0,240.0,180.0,200.0,238.0,211.0,160.0,176.0,213.0,254.0,109.0,42.0,135.0,113.0,234.0,19.0,23.0,103.0,174.0,64.0,73.0,131.0,79.0,63.0,196.0,113.0,175.0,130.0,26.0,170.0,142.0,184.0,10.0,209.0,152.0,202.0,134.0,244.0,144.0,171.0,173.0,238.0,226.0,130.0,79.0,251.0,239.0,133.0,14.0,75.0,206.0,49.0,154.0,67.0,63.0,185.0,117.0,140.0,45.0,203.0,114.0,32.0,248.0,100.0,75.0,89.0,208.0,72.0,99.0,248.0,180.0,42.0,19.0,60.0,86.0,50.0,43.0,213.0,252.0,129.0,13.0,99.0,204.0,80.0,155.0,217.0,5.0,183.0,101.0,32.0,78.0,7.0,114.0,34.0,229.0,69.0,67.0,88.0,105.0,32.0,41.0,164.0,178.0,141.0,75.0,39.0,212.0,214.0,34.0,196.0,237.0,140.0,8.0,66.0,229.0,192.0,159.0,250.0,207.0,147.0,39.0,164.0,86.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

And I have tweaked my query to:

GET images_features/_search
{
  "sort": [
    {
      "_score": {
        "order": "asc"
      }
    }
  ],
  "query": {
    "function_score": {
      "script_score": {
        "script": {
          "lang": "painless",
          "source": "double distance = 0; double diff = 0; String feature_vector = doc['feature_vector'].value; String []feature_vectors = /,/.split(feature_vector); String []input_vectors = /,/.split(params.query_feature); for(int j = 0; j < feature_vectors.length; j++){diff = Math.abs(Double.parseDouble(feature_vectors[j])) - Math.abs(Double.parseDouble(input_vectors[j])); distance = distance + (diff*diff)} return Math.sqrt(distance)",
          "params": {
            "query_feature": "4.0,96.0,222.0,120.0,234.0,240.0,180.0,200.0,238.0,211.0,160.0,176.0,213.0,254.0,109.0,42.0,135.0,113.0,234.0,19.0,23.0,103.0,174.0,64.0,73.0,131.0,79.0,63.0,196.0,113.0,175.0,134.0,26.0,170.0,142.0,184.0,10.0,145.0,152.0,202.0,134.0,244.0,144.0,171.0,173.0,238.0,226.0,130.0,79.0,251.0,239.0,133.0,14.0,75.0,206.0,49.0,154.0,67.0,47.0,185.0,117.0,140.0,45.0,195.0,114.0,32.0,248.0,100.0,75.0,89.0,208.0,72.0,99.0,248.0,180.0,42.0,19.0,28.0,86.0,50.0,43.0,213.0,252.0,129.0,13.0,99.0,204.0,80.0,155.0,217.0,5.0,183.0,101.0,32.0,78.0,7.0,114.0,34.0,229.0,69.0,67.0,88.0,105.0,0.0,41.0,164.0,178.0,141.0,75.0,39.0,212.0,214.0,34.0,196.0,237.0,140.0,8.0,66.0,229.0,192.0,159.0,250.0,207.0,147.0,39.0,164.0,86.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0"
          }
        }
      }
    }
  }
}

This is currently working for me.


(Ryan Ernst) #8

Sure, that works, but it means you will incur the cost of parsing on every document.


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.