Return per-field similarity for each of many fields

Context:
We've implemented a learning-to-rank re-scorer for Elasticsearch hits, but (unfortunately) it has to live outside of Elasticsearch, i.e. as a microservice that takes candidates from Elasticsearch, re-scores each hit with a statistical model, and re-ranks those candidates by the model's scores before returning them to the user.

As features, the model uses fields from the document hit, from the query, and the _score, which sort of captures the combination of these.

Goal:
I'd like to extend the statistical model to include the similarity between the query and each of a handful of individual document fields, as separate features. For example, I'd like Elasticsearch to include in the results of a /_search, some calculated fields such as all of these:

  • _score (I use this already, but it's pretty coarse resolution)
  • BM25 between query and document title
  • BM25 between query and document url
  • BM25 between query and document tags

and so on. I'm not picky, if I can't get BM25 exactly, some approximate value that roughly correlates is fine.

Ideally I need to get Elasticsearch to calculate these values and simply return them to me in the response payload. I don't need it to change how Elasticsearch internally computes the _score.

What I've tried

I know that ?explain=true interleaves these calculations directly into my hits, which in theory is all I need. The downside is that it bloats the payload at least 50x (it's sending far more than the few pieces of data I do need), and also it's a pain to programmatically parse out what I need from the explain report.

I think more useful would be to specify script_fields, one for each of the per-field BM25s I want back. Pseudocode:

GET /_search
{
    "query" : {
        "query_string": {
            "query": "purple island",
            "fields": ["title", "url", "description", "tags"]
        }
    },
    "script_fields" : {
        "sim_to_title" : {
            "script" : {
                "source": "doc['title'].PerFieldSimilarity"
            }
        },
        "sim_to_url" : {
            "script" : {
                "source": "doc['url'].PerFieldSimilarity"
            }
        }
    }
}

but I'm at a loss as to what these scripts actually need to look like. I found https://www.elastic.co/guide/en/elasticsearch/painless/6.5/painless-similarity-context.html but not fully sure how to use it; I get the feeling these only work at index time to define a custom similarity metric, whereas I need to return all these values at search time.

What ideas does this community have?

Thanks very much! --Jeff

There is no way to produce scores through script_fields. Painless has different contexts and in the field context we can only have access to the values of document's fields.

Moreover, each search request will produce scores in a single way. The only way to achieve what you are trying to do is to issue 4 different search requests, 3 of which will be multi match queries.

We also currently working on improving scoring across multiple fields which can also take document statistics into account, but this is work in progress.

Thanks for this information! In particular it was helpful to know that script_fields happens in a Painless Field context; I'd seen those contexts listed but the documentation is missing the mapping between context and other ES syntax such as script_fields.

Regarding the "improving scoring across multiple fields" issue... I don't think adds much to my use case, at least in the sense that our machine learning model re-ranks results according to specific combinations of fields, a method that goes beyond the heuristics of BM25/BM25F.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.