Get scores for two parts of bool query

zwirni · October 14, 2019, 1:50pm

Hello there,

Right now I'm struggling finding out how to obtain relevant results for two parts of my query. My documents are basically some millions of text documents with a "content" field. What would be an appropriate way to receive results where

some entity is deemed relevant, say e.g. Germany
some topic is deemed relevant, e.g. BMW, Daimler, Volkswagen

I.e., ideally I would obtain two separate scores for both queries so I can judge how I want to weight both scores after querying, e.g.

1. id: 1; score_entity: 17.23; score_topic: 5.34
2. id: 7; score_entity: 12.01; score_topic: 24.02
...

Right now I'm combining both parts in a common bool clause:

GET index/_search
{
        "query": {
            "bool": {
                "must": {
                    "match_phrase": {
                        "content": {
                            "query": "Germany"
                        }
                    }
                },
                "should": [{
                    "match_phrase": {
                        "content": {
                            "query": "BMW"
                        }
                    }
                }, {
                    "match_phrase": {
                        "content": {
                            "query": "Daimler"
                        }
                    }
                }, {
                    "match_phrase": {
                        "content": {
                            "query": "Volkswagen"
                        }
                    }
                }],
                "minimum_should_match": 0
            }
        }
    }

By this means I get an overall result where high-scoring documents are probably quite relevant, but I cannot be sure about it. If they score about average, they might be relevant for either search. I intentionally set minimum_should_match to 0 to include results that are relevant to the country but not the topic.

The best thing I've come up with during my search is only querying for the country first, then select all relevant documents and insert the IDs into a separate question. I guess this would work in principle, however I'm not sure if the engine accepts several thousands as document ids as query input. Besides, there is another issue with this solution:

Ideally I want to be able to get comparable results for different countries, e.g. to set a common cutoff score for all countries, be it Germany or San Marino. By "comparable" I mean that the score should only depend on the ratio of the number occurrences of the country to the text length. To this end, I would basically have to disable different parts of the scoring logic, most notably the idf part. The outcome would be "tf/field length". How do I accomplish this query? I've read about the similarity module and there's even a relevant example, but I was wondering if I can change the behavior during query time. Ideally, the country search would work like this, but the topic search still uses BM25.

I'm using ES 6.3.2 right now.

I'm grateful for all suggestions and ideas. Thank you in advance!

Best wishes
Henning

system · November 11, 2019, 1:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to combine Function Score Query and Bool Query? Elasticsearch	2	1887	July 5, 2017
Obtaining individual scores from each query inside of a bool query Elasticsearch	1	381	March 28, 2020
Help needed with scoring for Boolean Query Elasticsearch	1	299	July 6, 2017
What will be the best way to get the relative score of specific bool clauses on the hits? Elasticsearch	1	290	May 9, 2019
Matching multiple topics (sets of keywords) Elasticsearch	4	1139	May 28, 2021

Get scores for two parts of bool query

Related topics