I am having problems understanding the docFreq for multi_match queries when using the type cross_fields.
I created a one shard test index, onto which I pushed 4 documents:
{ "random0": "banana", "random1": "banana", "random4": "banana"}
{ "random0": "banana", "random2": "banana"}
{ "random0": "banana", "random3": "banana"}
{ "random0": "banana", "random4": "banana"}
So each document has the field random0 and both documents one and four have the field random4.
I then ran the following explain query (all 4 docs match this query)
GET test/test/1/_explain
{
"query":{ "bool":{ "should":[
{
"multi_match":{
"query":"banana ",
"fields":[
"random0",
"random4"
],
"type":"cross_fields"
}}]}}}
I would expect the docFreq to be 4, but in the results in both the random0:banana and the random4 sections I have a docFreq of 2.
I understood that for cross_match the idea was that the fields were treated like one big field, so why not 4?
Then if I add the field random4 with some other value than banana, "random4":"apple" to doc2 the docFreq jumps to 3. If I add this field to doc4 as well, then the docFreq jumps to 4. The docFreq seems to be the number of docs with this field, not matching docs, and certainly not the max over all the fields in the mutli_match.
But in a bigger index (one shard 7M docs) we are seeing even more strange results, the docFreq is much bigger than the min field with docs with that field, but much smaller than the max of matching docs. So how is docFreq calculated?
I have tested this on an old version 5.6 and also on the current version 7.10. Both have the same behavior.