Hello,
I'm investigating whether ES is able to order documents wrt. a relevance score for two or more keywords sets, assigning a high score only to documents related strongly to all the keyword sets.
Lets say I have indexed articles from a news site. I have keywords for two topics:
I want to find articles where election candidates talk about sport. I can form a query:
(election OR debate OR president OR democrats) AND (football OR cycling OR olympics OR race)
I can add minimum_should_match: 1 to ensure the returned documents will have matches from both topics. But this doesn't prevent the problem, which is that the ordering of the search results will be poor. Since my news site has many articles about elections, the top results will talk almost solely about elections. The requirement to match the "sport" topic can be fulfilled by matching a minor keyword or a keyword with varying semantics (like "race").
To achieve the wanted ordering I could run the following:
execute a query for the single topic "election"
execute a query for the single topic "sport"
combine the scores from the two above queries: assign a high score only if the scores for both topics are high (e.g. I could use a harmonic mean or a minimum)
But of course this solution is not feasible for large datasets. So the question is if Elasticsearch provides some query types to handle this?
Extra information:
I can have hundreds of keywords for each topic, with each keyword having its boost, and the differences between boosts is large, e.g. it can be 100 for the top keyword and 1 for a minor keyword. So relying on minimum_should_match or filtering in function_score won't be reliable.
I don't have a solution for you, as ranking documents is a difficult topic that needs a lot of experimentation.
combine the scores from the two above queries: assign a high score only if the scores for both topics are high
Your scoring formula does't seem to be clear, e.g. what is a "high" score? How scores can be combined?
I can only suggest some elasticsearch tools that you can explore:
dis_max query allows to boost scores of the 1st query by the scores of the 2nd query
rescoring – allows to rescore top hits from the 1st query based on some additional 2nd query
rank_features field type allows to index pre-calculated topics for each document. Then using rank_feature query, you can rank documents based on how well they reflect topics
As an experiment I did what I described above and got good results, i.e.:
read all results of the query for the topic election, normalized the relevancy scores by dividing each score by the max score (so all the relevancy scores were in the range 0.0-1.0)
did the same for the sport topic
ordered all results from both queries by using the formula: score[id] = harmonic_mean(election_scores[id], sport_scores[id])
This works well because harmonic_mean is large only if both arguments are large (i.e. the document is relevant to both topics at once). E.g. harmonic_mean(0.9, 0.8) == 0.85, and harmonic_mean(0.9, 0.1) == 0.18. Probably multiplication or minimum would also work.
I don't think this query could prevent the main problem, i.e. the top matches being dominated by documents relevant only to a single topic - max doesn't care about the second topic. And "tie breaking increments" don't look sufficient for my case, i.e. in most cases these increments won't be large enough to give a score greater than what documents relevant to a single topic only have.
If there would be dis_min query then it should work.
This allows only "reordering just the top (eg 100 - 500) documents" so it won't be enough, in my index the documents relevant to both topics at once are often not scored very high for individual topics.
Unfortunately my topics (keyword sets) are defined dynamically so it's not possible to have a list of all possible keyword sets at indexing time.
On the other had, used in elasticsearch BM25 is a well-researched information retrieval model. And in your case I would start with using a standard match query combining all terms from 2 topics to a single query, or a bool query with 2 separate should clauses with a match query for each separate topic.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.