Log reg scoring with nested fields, use case for multiply score in should clauses?

nihaux · June 5, 2021, 12:08pm

Hello everyone this might be a semi-long post.

My use case is the following.

We have a stack that makes use of a logistic regression on different features in python to rank documents.

I am POCing a version on ES to be able to make use of full text search before applying log reg.

the simplified properties of my index ressemble something like this:

"properties": {
    "searchable_objects_belonging_to_main_doc": {
        "type": "nested",
        "properties": {
            "searchable_property": "text",
            ...
            "property_use_to_make_calculation": "double"

        }
    }
    "other_property": "double"
    ...
}

My first approach was to use a nested query, then use a rescore function with a painless script to apply log reg on the docs returned by the nested query.
=> problem, I need some information about the nested objects that matched in the first nested query in order to apply the log reg. From what I read, getting information about nested doc match in rescore is not possible because they are separate lucene docs.

Second approach was to rethink the mapping. Each of the nested objects would be a document containing the main document it belongs to. This way no more need for nested objects.
=> problem, i would need to use collapse before rescore so that the window of documents sent to rescore would contain each main document only once. And from what I read, it is not possible to use collapse + rescore (explicit exception is raised).

Third approach:
Our log reg is of the form:

x = feat1 + feat2 + feat3 ... + featN
return 1/(1 + Math.exp(x))

which could translate to
(1/(1 + Math.exp(feat1))) * (1/(1 + Math.exp(feat2))) * ... * (1/(1 + Math.exp(featN)))
So, given feat1 is based on the values contained in the matching documents of the nested request, I could make a nested query + script score inside of it and retrieve the value of the score in the top document ! => this works
BUT now i need to be able to multiply the retrieved value with other values. Something like:

"should": [
    {"nested": ... => returns the value of (1/(1 + Math.exp(feat1))) for the matching nested docs},
    { this query would calculate (1/(1 + Math.exp(feat2))) },
    { ... },
    { this query would calculate (1/(1 + Math.exp(featN))) }
]

but for that to work, I would need to have the should clause multiply the score and from what I read on this forum this has not been implemented due to lack of use case.
Maybe this one is valid ? Or maybe I am going the wrong direction and I gladly take any pointers

mayya · June 10, 2021, 9:26pm

I am not super clear about your problem statement, for example is feat1 based on scores of top level documents, while feat2 is based on scores of nested docs?

One thing I can recommend is to explore include_in_parent param for nested documents. It gives you access to nested documents' fields though top level documents.

nihaux · June 13, 2021, 5:43pm

Hello Mayya, and thx for taking the time.

To go back to my example, feat 1 is calculated based on values contained in the nested documents that match a particular query
For instance, given this simplified example:

MainDoc: {
  nestedDocs: [
    { title: "foo", value: "10" },
    { title: "foo", value: "2" }, 
    { title: "bar", value: "1" }, 
    { title: "bar", value: "1" }, 
  ]
  mainDocValue: 0.5
}

I would like to sum the value of the nestedDoc that match "title" == "foo" then multiply the result by mainDocValue.
I can get the sum of the value of the nestedDoc that match "title" == "foo" by using a nested query + script score.
something like:

"nested": {
  "path": "nestedDocs",
  "query": {
    "script_score": {
      "query": {"match": { "nestedDocs.title": "foo" }} ,
      "script": {"source": "doc['nestedDocs.value'].value)"},
    }
  },
  "score_mode": "sum",
}

now all i need to do is multiply the result of this query by mainDocValue and this is where i'm stuck.
Using bool query, i can sum different queries but not multiply:

{
  "bool": {
    "should" [
      {...the nested query described up there},
      {...query that returns mainDocValue }
    ],
   "score_mode": "multiply" <= this would be great :)
  }
}

From what i understand include_in_parent does not solve my problem as i need to make the calculation on a subset of the nested document dependent of a query on them.

mayya · June 14, 2021, 9:25pm

Thanks for providing more details, the issue is clear now.
One way you can achieve this which is a little bit bulky is to put your nested query under another script_score query that you run on your main docs, like this:

{
  "query": {
    "script_score": {
      "query": {
        "nested": {
          "path": "nestedDocs",
          "query": {
            "script_score": {
              "query": {
                "match": {
                  "nestedDocs.title": "foo"
                }
              },
              "script": {
                "source": "doc['nestedDocs.value'].value"
              }
            }
          },
          "score_mode": "sum"
        }
      },
      "script": {
        "source": "_score * doc['mainDocValue'].value"
      }
    }
  }
}

system · July 12, 2021, 9:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.