How to perform math formula in Elasticsearch script score based on field value conditions?

Mariusz_Liksza · October 18, 2021, 9:36pm

I would like to apply a simple math formula to all documents in Elasticsearch DB.
The tricky thing is that each variables consists of multiply values which are at the same time person attributes.

{
        "_id" : "1",
        
          "A" : [
            1,
            2,
            3,
            4
          ],
          "B" : [
            5,
            6
          ]
}

Here we have UserId = 1 having

attributes 1,2,3,4 from source A
attributes 5,6 from source B

And now, I would like to calculate the following formula for all documents:

5 * 1 (if A has value 2) or 5 * 0 (if A has not value 2) + 4 * 1 (if A has value 3) or 4 * 0 * (if A has not value 3)

In SQL it would be very simple:

SELECT 5 * A_2 + 4 * A_3 FROM table.

As in SQL table each attribute would be a separate column but in Elasticsearch we keep all information about a person (from all tables/sources) in one document.

I think it is feasible using CASE statements but I would prefer to do this in more elegant way, especially as sometimes we need to use 200 or more attributes in one equation and I do not want to have > 200 CASE statements.

I was trying something like this:

GET /index/_search
{
      "query": {
        "function_score": {
          "script_score": {
            "script": {
              "params": {
                "a": 5,
                "b": 4
              },
              "source": "if (2 isin doc['A'].value) { return params.a} else { return 0 } + 
                         if (3 isin doc['A'].value) { return params.b} else { return 0 }"

            }
          }
        }
      }
  }

But, ES can't handle with checking if '2' is an element of array A.

Mariusz_Liksza · October 20, 2021, 9:21pm

I was able to achieve that using below query:

    GET /index/_search
{
    "query": {
        "function_score": {
            "query": {
                "match_all": {}
            },
            "script_score": {
                "script": {
                    "params": {
                        "a": 5,
                        "b": 4
                    },
                    "source": """
              double sum = 0.0; 
              double val = 0;
              for (item in doc['A'
                    ]) {
                try {
                  val = params.get(Long.toString(item))
                        } 
                catch (NullPointerException e){
                  val = 0
                        } if (val != 0) {
                  sum += val
                        }
                    } 
              return sum
                """
                }
            }
        }
    }
}

It works but the performance is not perfect. I get a timeout error after 5 minutes for 250 MM documents. Do you know how to achieve the same thing while reducing elapsed time?

warkolm · October 21, 2021, 2:54am

Welcome to our community!

I'm not going to be able to give you much assistance on the script side of things sorry.
My suggestion would be to move this to being calculated at ingest time, which is a much more efficient approach.

Mariusz_Liksza · October 21, 2021, 6:57am

Hi @warkolm !

Thank you for your suggestion.

Unfortunately, params section differs every time we run above script, values for "a", "b" etc. are dynamic.

Christian_Dahlqvist · October 21, 2021, 7:25am

It looks like all documents in the index will need to be scored using the script, which can be slow. As processing of the query per shard is single-threaded you may improve performance by increasing the number of primary shards to match the number of cores you have. If you however expect to have multiple queries of this type running concurrently that may not be optimal and I would recommend you benchmark using real data and queries.

If you can add a filter to the query to directly filter out poor candidates you could potentially also speed it up. This might e.g. be a filter removing documents that does not have at least one of the parameters provided and therefore would get very poor scores.

Mariusz_Liksza · October 22, 2021, 11:29am

Thank you @Christian_Dahlqvist for your suggestions!

Yes, we expect to have multiply queries like this one at the same time but even though multi-threading is worth to try.

Excluding poor candidates sounds reasonable but we loose some good candidates that way. Let's say a candidate have 29 from 30 the best variables. We would like to still have this person included. Probably something like: let's filter out all people that have less than 80% best 30 variables would work but how to do that in an efficient way...

Christian_Dahlqvist · October 22, 2021, 12:10pm

That is something you need to know the data to determine. If you do not filter some way you do not benefit from the indexing that makes Elasticsearch fadt and you likely have a query that does scale badly.

system · November 19, 2021, 12:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Trying to understand Painless Elasticsearch	3	463	November 7, 2017
Custom return fields with native scripts Elasticsearch	1	378	July 6, 2017
Math calculation on different fields? How to? Elasticsearch	4	1017	June 22, 2018
Is it possible to perform calculations using multiple fields in a document Elasticsearch	4	16675	July 5, 2017
Performance issue with script scoring with fields having a large array Elasticsearch	2	1612	July 6, 2017

How to perform math formula in Elasticsearch script score based on field value conditions?

Related topics