Does it work-as-expected in common sense, when boosting individual fields of cross_fields query?
It looks like the following page states it's impractical to use boosting of cross_fields query:
Note that cross_fields is usually only useful on short string fields that all have a boost of 1. Otherwise boosts, term freqs and length normalization contribute to the score in such a way that the blending of term statistics is not meaningful anymore.
While the following page points the boosting as an advantage over "one big field":
Individual fields can be boosted, which can’t be done with the copy_to field.
Cross fields is designed to select "the right" field for each word.
So given a search for "mark" in both the forename and surname fields it will prefer the more-common forename field over the surname.
It does this by tweaking the reported DF values for surname:mark so that it is one doc more than the most-likely field of forename. Incrementing DF by one subtly reduces IDF to make surname:mark rank lower than forename:mark. The other words in the query e.g. "harwood" will be still be rewarded more highly than mark because it is a rarer word.
All the subtlety of this per-word score-tweaking is undone if you just slap a global boost on a field.
To support very rare cases where you just want to make use of cross-field's down-playing of rare-but-wrong fields - the original discussion is here
Personally speaking I was all for throwing errors if users tried to set a boost on a cross_fields query.
@Mark_Harwood Thanks for the link to original discussion, I've read it throughout.
Still, I want to make sure I get it right:
cross-field is absolutely valid to use on analyzed and/or long-length and/or short-length fields like "description" (according to @Clinton_Gormley1 's comment )
it's valid to apply boost option to thwart the effect of TF and normalization factor (according to @jpountzcomment )
If the above is true,
User-supplied field-level boosts can be used as a tool to give some priorities on the fields, while IDF-scores are even out by cross-field. E.g. we search for "Will Smith" and want to give a priority for "description" field over "first name", "last name" and "biography" fields.
In practice, the following statements hold true for most search apps:
Users are lazy. We can't be bothered to put the right words into the right input boxes in a structured form so we reach for typing multiple keywords into a single search box. So no useful context is provided for each word. Equally, users are unlikely to specify "boosts" for fields and certainly not per-search-term field boosts.
Application developers are lazy. We won't parse the user's search input, understand each search term provided and for each determine which field is the most likely target and provide query-specific boosts for each term and field pair. We won't consider if the sequence of the words is important and if they represent a phrase. We'll just chuck the user's search text at the search engine and hope for the best (sometimes developers will use a global "template" query with boosts for what they consider to be generally important fields but this is not sensitive to the actual search words a user provides).
Cross fields is trying to do the analysis that application developers fail to do in 2 - understand the actual query terms being used on a per-search basis and then decide on the boosts required to tailor that specific query.
I'm not so sure. "Boost of 1" means the default boost i.e. no alteration to the input score. So the lazy user and lazy developer have both opted to run with defaults.
"Short string fields" means things like firstname, surname or address where a search for "john smith" would be really bad if you didn't have cross_fields tweaking the IDF for you to stop "address:smith street" being ranked top.
We're using cross_fields, so "address" field won't be biased (IDF score is even out). Now, I just make "description" field to be a priority field over the rest of the fields. Why can't I benefit from boost option in this scenario?
The questionable "benefit" of your boost is the ability to provide a "one-size-fits-all-search-terms" multiplier that is not tweaked for each term in the user's input.
It does not matter if the user's search, when tokenized, contains what cross_field can clearly see as both an address and a name - your ham-fisted boost is going to try steer both terms in the direction of a particular field.
It's like taking an airhorn to a classical music concert - you can do it but it's undoing some of the subtlety in the process.
@Mark_Harwood Thanks for detailed explanation. It confirms what I already understood.
As I see, it's actually not undone, right? E.g. if user-boost is provided for "first name", it evens out IDF-scores for every query-term and only then applies the boost for "first name" field (for every query-term). At the same time, user-boost is not provided for "last name", so it evens out IDF-scores for "last name" field and the boost "1" remains by default.
Also, I still don't get why can't we use cross_fields on big-free-text fields.
My point is user-supplied boosts typically don’t consider if the actual search terms provided in a particular user query are surnames, forenames or a mix of both. They just apply a lazy one size-fits-all policy of “I think field X is great”. (Correct me if I’m wrong in my assumption about your app). Such a boosting policy undoes the minute scoring changes cross-fields does to understand each term in the query and favour what has been determined as the right field context for that word.
Like I say - you’re free to do what you want though so if you want to go ahead and supply your own boost, more power to you.
In our app, neither end-user nor developer knows which specific query-term is appropriate for specific column, even if developer is looking at some query by hand. End-user's "friendly-query" is searched over "internal/very technical" columns.
So as I see, the case you wrote is different from our app's case mentioned above: there is no relation between mimic "input boxes" and "most-likely column"; at the same time, we'd like to even out IDF-scores between columns.
Sorry, I don't know what you mean by "mimic input boxes". My assumption is your app is like most search apps - there's a single search input box into which people type one or more words.
If so, then it sounds like what you have been asking for is based on the following principles:
We have a favourite field we always want to boost regardless of what a user typed.
We make no attempt to favour which field each individual search term naturally belongs to
We like to think some of the search words in a query are more important than others
If you didn't want rule 3) then you could wrap each search term in a constant_score query to dispense with IDF completely - because your app knows best (see rule 1)
However, I assume you want rule 3 because users can type a mix of terms in a query eg (rareWord, commonWord) and given the choice you prefer docs to match rareWord rather than only match commonWord. To have this preference you need to know which of these words is rarer - using IDF. The key question is which field do you use to obtain this IDF? The sensible answer is "the most likely one" which is what cross-fields does (other search types typically favour the least likely).
So, bringing this all together - you propose a system that prefers rareWord over commonWord based on an understanding of "correct" context (thanks to cross fields) but your rule 1 will just boost "hits in field X" because you think that's a generally good thing to do for all queries, regardless of the context of each search term.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.