Context:
We've implemented a learning-to-rank re-scorer for Elasticsearch hits, but (unfortunately) it has to live outside of Elasticsearch, i.e. as a microservice that takes candidates from Elasticsearch, re-scores each hit with a statistical model, and re-ranks those candidates by the model's scores before returning them to the user.
As features, the model uses fields from the document hit, from the query, and the _score
, which sort of captures the combination of these.
Goal:
I'd like to extend the statistical model to include the similarity between the query and each of a handful of individual document fields, as separate features. For example, I'd like Elasticsearch to include in the results of a /_search
, some calculated fields such as all of these:
-
_score
(I use this already, but it's pretty coarse resolution) - BM25 between query and document title
- BM25 between query and document url
- BM25 between query and document tags
and so on. I'm not picky, if I can't get BM25 exactly, some approximate value that roughly correlates is fine.
Ideally I need to get Elasticsearch to calculate these values and simply return them to me in the response payload. I don't need it to change how Elasticsearch internally computes the _score
.
What I've tried
I know that ?explain=true
interleaves these calculations directly into my hits, which in theory is all I need. The downside is that it bloats the payload at least 50x (it's sending far more than the few pieces of data I do need), and also it's a pain to programmatically parse out what I need from the explain report.
I think more useful would be to specify script_fields
, one for each of the per-field BM25s I want back. Pseudocode:
GET /_search
{
"query" : {
"query_string": {
"query": "purple island",
"fields": ["title", "url", "description", "tags"]
}
},
"script_fields" : {
"sim_to_title" : {
"script" : {
"source": "doc['title'].PerFieldSimilarity"
}
},
"sim_to_url" : {
"script" : {
"source": "doc['url'].PerFieldSimilarity"
}
}
}
}
but I'm at a loss as to what these scripts actually need to look like. I found https://www.elastic.co/guide/en/elasticsearch/painless/6.5/painless-similarity-context.html but not fully sure how to use it; I get the feeling these only work at index time to define a custom similarity metric, whereas I need to return all these values at search time.
What ideas does this community have?
Thanks very much! --Jeff