Hello there,
Right now I'm struggling finding out how to obtain relevant results for two parts of my query. My documents are basically some millions of text documents with a "content" field. What would be an appropriate way to receive results where
- some entity is deemed relevant, say e.g. Germany
- some topic is deemed relevant, e.g. BMW, Daimler, Volkswagen
I.e., ideally I would obtain two separate scores for both queries so I can judge how I want to weight both scores after querying, e.g.
1. id: 1; score_entity: 17.23; score_topic: 5.34
2. id: 7; score_entity: 12.01; score_topic: 24.02
...
Right now I'm combining both parts in a common bool clause:
GET index/_search
{
"query": {
"bool": {
"must": {
"match_phrase": {
"content": {
"query": "Germany"
}
}
},
"should": [{
"match_phrase": {
"content": {
"query": "BMW"
}
}
}, {
"match_phrase": {
"content": {
"query": "Daimler"
}
}
}, {
"match_phrase": {
"content": {
"query": "Volkswagen"
}
}
}],
"minimum_should_match": 0
}
}
}
By this means I get an overall result where high-scoring documents are probably quite relevant, but I cannot be sure about it. If they score about average, they might be relevant for either search. I intentionally set minimum_should_match to 0 to include results that are relevant to the country but not the topic.
The best thing I've come up with during my search is only querying for the country first, then select all relevant documents and insert the IDs into a separate question. I guess this would work in principle, however I'm not sure if the engine accepts several thousands as document ids as query input. Besides, there is another issue with this solution:
Ideally I want to be able to get comparable results for different countries, e.g. to set a common cutoff score for all countries, be it Germany or San Marino. By "comparable" I mean that the score should only depend on the ratio of the number occurrences of the country to the text length. To this end, I would basically have to disable different parts of the scoring logic, most notably the idf part. The outcome would be "tf/field length". How do I accomplish this query? I've read about the similarity module and there's even a relevant example, but I was wondering if I can change the behavior during query time. Ideally, the country search would work like this, but the topic search still uses BM25.
I'm using ES 6.3.2 right now.
I'm grateful for all suggestions and ideas. Thank you in advance!
Best wishes
Henning