docFreq for multi_match with type cross_fields

Lea_Lacoste · December 9, 2020, 12:02pm

I am having problems understanding the docFreq for multi_match queries when using the type cross_fields.

I created a one shard test index, onto which I pushed 4 documents:

{ "random0": "banana", "random1": "banana", "random4": "banana"}
{ "random0": "banana", "random2": "banana"}
{ "random0": "banana", "random3": "banana"}
{ "random0": "banana", "random4": "banana"}

So each document has the field random0 and both documents one and four have the field random4.

I then ran the following explain query (all 4 docs match this query)

GET test/test/1/_explain
{
   "query":{ "bool":{ "should":[
            {
               "multi_match":{
                  "query":"banana ",
                  "fields":[
                     "random0",
                     "random4"
                  ],
                  "type":"cross_fields"
}}]}}}

I would expect the docFreq to be 4, but in the results in both the random0:banana and the random4 sections I have a docFreq of 2.

I understood that for cross_match the idea was that the fields were treated like one big field, so why not 4?

Then if I add the field random4 with some other value than banana, "random4":"apple" to doc2 the docFreq jumps to 3. If I add this field to doc4 as well, then the docFreq jumps to 4. The docFreq seems to be the number of docs with this field, not matching docs, and certainly not the max over all the fields in the mutli_match.

But in a bigger index (one shard 7M docs) we are seeing even more strange results, the docFreq is much bigger than the min field with docs with that field, but much smaller than the max of matching docs. So how is docFreq calculated?

I have tested this on an old version 5.6 and also on the current version 7.10. Both have the same behavior.

Mark_Harwood · December 9, 2020, 12:48pm

The original motivation was that it would favour the "right" field - the most likely context for each term.

Lucene's natural tendency is to promote matches on rare things so when searching for "John Sablosky" in first_name and last_name fields it would naturally favour the most bizarre context for each term e.g. the last_name: John.
Cross_fields makes up for this by taking the most likely interpretation for each term (e.g. John is a first name) and using that doc frequency for all John-related matches - but with one caveat - the "correct" field (first_name:John) is boosted to beat any last_name:John by making it artificially more interesting to Lucene (subtracting 1 from first_name:john's doc frequency). A similar thing happens to the other term "Sablosky", favouring the last_name field. But overall these tweaks keep the relative importance of the words - Sablosky is rarer than John and seen as a better match if only one of the 2 search terms is found.

I know some things have been changing in the implementation but that was the original motivation when I wrote it.

system · January 6, 2021, 12:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.