Problem Computing Euclidean Distance using script score query

Vaibhav_Thapliyal · September 17, 2018, 11:40am

Hi Everyone,

I am trying to build a reverse image/Image Similarity functionality using Elasticsearch. I have successfully indexed the feature vectors in Elasticsearch as an array which looks something like this:

"feature_vector" : [157, 144, 26, 107, 97, 62, 114, 248 ........ ]

The size of this array is 256.

Now I am trying to run a Euclidean Distance formula as a script.

Here's the formula I am trying to implement:

Here's the script:

GET images_features/_search
{
  "sort": [
    {
      "_score": {
        "order": "asc"
      }
    }
  ],
  "query": {
    "function_score": {
      "script_score": {
        "script": {
          "lang": "painless",
          "source": "double distance = 0; double diff = 0; if(doc['feature_vector'].size() != params.query_feature.size()){distance = 0} else{for(int j = 0; j < doc['feature_vector'].size(); j++){diff = Math.abs(doc['feature_vector'][j]) - Math.abs(params.query_feature[j]); distance = distance + (diff*diff)}} return Math.sqrt(distance)",
          "params": {
            "query_feature": [
              170,
              134,
              191,
              75,
              139,
               .
               .
               .
              180,
              232,
              150,
              182,
              208,
              239,
              109,
              232,
              106
            ]
          }
        }
      }
    }
  }
}

However I am facing an issue here because the score that's being calculated is wrong and the results returned are very vague. I got the score of 2281.6973 through running this script while the java and python programs return the score of 64.093681.

I have verified the correct value of an input feature with a feature stored in Es in both python(using the scipy library) and JAVA(by writing the same script as a java program) and both of them match.

Is there any issue in the script that I am missing out??

Vaibhav_Thapliyal · September 17, 2018, 12:50pm

Hi,

I am referring this link as part of designing the overall system.

https://www.linkedin.com/pulse/hacking-elasticsearch-image-retrieval-ashwin-saval

Any help would be appreciated.

rjernst · September 17, 2018, 5:35pm

Multiple values for doc values are stored sorted, so your order is changing, and I would expect this causes catastrophic changes to the score.

You will likely need to store your vector using a binary field. Note, however, that there is no documentation for accessing this from docvalues. IIRC, you will need to write an advanced script (ie script in java) so you can access the necessary lucene doc values classes. Take a look at how binary values are encoded in BinaryFieldMapper.CustomBinaryDocValuesField. This is the class that elasticsearch uses to encode multiple values in a binary field (which Lucene does not natively support). Whether you pass in an opaque single binary value, or separate values, you will need to decode the number of values header in the value.

Hanish_Bansal · September 17, 2018, 5:50pm

@rjernst Is there any way that multiple values of document can be stored without any sorting? As per our understanding, multiple values are stored as array type and preserve the order.

rjernst · September 17, 2018, 6:18pm

No, this is not configurable. Lucene does not have any native non sorted, multi valued numerics. When you say "multiple values are stored as array type and preserve the order", I think you may be confusing doc values (which is how values are accessed through doc in scripts) with _source, which is the raw input json, which is kept as is.

Vaibhav_Thapliyal · September 17, 2018, 7:48pm

Hi Ryan,
Is there any other way that I can access this _source field in my script query or is this available only in the update scripts?

Vaibhav_Thapliyal · September 17, 2018, 7:55pm

I have devised a possible workaround for this problem that is working for me.

Instead of storing the values as arrays I am now storing them as comma separated text in a keyword field. In the same way I have changed the input query param too. They both look something like this:

4.0,96.0,159.0,120.0,234.0,240.0,180.0,200.0,238.0,211.0,160.0,176.0,213.0,254.0,109.0,42.0,135.0,113.0,234.0,19.0,23.0,103.0,174.0,64.0,73.0,131.0,79.0,63.0,196.0,113.0,175.0,130.0,26.0,170.0,142.0,184.0,10.0,209.0,152.0,202.0,134.0,244.0,144.0,171.0,173.0,238.0,226.0,130.0,79.0,251.0,239.0,133.0,14.0,75.0,206.0,49.0,154.0,67.0,63.0,185.0,117.0,140.0,45.0,203.0,114.0,32.0,248.0,100.0,75.0,89.0,208.0,72.0,99.0,248.0,180.0,42.0,19.0,60.0,86.0,50.0,43.0,213.0,252.0,129.0,13.0,99.0,204.0,80.0,155.0,217.0,5.0,183.0,101.0,32.0,78.0,7.0,114.0,34.0,229.0,69.0,67.0,88.0,105.0,32.0,41.0,164.0,178.0,141.0,75.0,39.0,212.0,214.0,34.0,196.0,237.0,140.0,8.0,66.0,229.0,192.0,159.0,250.0,207.0,147.0,39.0,164.0,86.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

And I have tweaked my query to:

GET images_features/_search
{
  "sort": [
    {
      "_score": {
        "order": "asc"
      }
    }
  ],
  "query": {
    "function_score": {
      "script_score": {
        "script": {
          "lang": "painless",
          "source": "double distance = 0; double diff = 0; String feature_vector = doc['feature_vector'].value; String []feature_vectors = /,/.split(feature_vector); String []input_vectors = /,/.split(params.query_feature); for(int j = 0; j < feature_vectors.length; j++){diff = Math.abs(Double.parseDouble(feature_vectors[j])) - Math.abs(Double.parseDouble(input_vectors[j])); distance = distance + (diff*diff)} return Math.sqrt(distance)",
          "params": {
            "query_feature": "4.0,96.0,222.0,120.0,234.0,240.0,180.0,200.0,238.0,211.0,160.0,176.0,213.0,254.0,109.0,42.0,135.0,113.0,234.0,19.0,23.0,103.0,174.0,64.0,73.0,131.0,79.0,63.0,196.0,113.0,175.0,134.0,26.0,170.0,142.0,184.0,10.0,145.0,152.0,202.0,134.0,244.0,144.0,171.0,173.0,238.0,226.0,130.0,79.0,251.0,239.0,133.0,14.0,75.0,206.0,49.0,154.0,67.0,47.0,185.0,117.0,140.0,45.0,195.0,114.0,32.0,248.0,100.0,75.0,89.0,208.0,72.0,99.0,248.0,180.0,42.0,19.0,28.0,86.0,50.0,43.0,213.0,252.0,129.0,13.0,99.0,204.0,80.0,155.0,217.0,5.0,183.0,101.0,32.0,78.0,7.0,114.0,34.0,229.0,69.0,67.0,88.0,105.0,0.0,41.0,164.0,178.0,141.0,75.0,39.0,212.0,214.0,34.0,196.0,237.0,140.0,8.0,66.0,229.0,192.0,159.0,250.0,207.0,147.0,39.0,164.0,86.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0"
          }
        }
      }
    }
  }
}

This is currently working for me.

rjernst · September 17, 2018, 8:02pm

Sure, that works, but it means you will incur the cost of parsing on every document.

system · October 15, 2018, 8:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Script Score Query Cosine Similarity Elasticsearch	4	2210	August 8, 2019
How to store the vectors (arrays) and get the Euclidean distance for the search request? Elasticsearch	1	913	April 8, 2019
ScriptEngine - ScoreScript : cosine similarity Elasticsearch	2	1057	January 24, 2019
ElasticSearch: L2 norm script_score returns false distances Elasticsearch	2	1083	February 5, 2021
Performance issue with script scoring with fields having a large array Elasticsearch	2	1597	July 6, 2017

Problem Computing Euclidean Distance using script score query

Related topics