Problem with Dis Max Query


#1

I have this documents:

{
"firstfield" : "House"
}

{
"firstfield" : "Red House"
"secondfield" : "Blue House"
}

I want to query every field of a document. Only the field with the maximum score should set the final score for the document.

For example when I search for 'House', the first document should have the highest score.

I tried to achieve this with a dis max query. But it seems that there is always a boost if both fields contain the word.

My query looks like this:

"dis_max" : {
"tie_breaker" : 0.0,
"queries" : [ {
"match" : {
"firstfield" : {
"query" : "House",
"operator" : "AND"
}
}
}, {
"match" : {
"secondfield" : {
"query" : "House",
"operator" : "AND"
}
}
} ]
}

Anyone has a solution for my problem? I am using Elasticsearch 2.2.


(Ivan Brusic) #2

Why should the first document score higher? Can you post the explanation of
the query?

If the second document scores higher, then perhaps the term 'house' has a
higher IDF value in the 'secondfield' field. Hard to tell without data.

Ivan


#3

I gave you all the data I test with ( only 2 documents ).

This query will have this scores:

"dis_max" : {
"tie_breaker" : 0.0,
"queries" : [ {
"match" : {
"firstfield" : {
"query" : "House",
"type" : "boolean",
"operator" : "AND"
}
}
}, {
"match" : {
"secondfield" : {
"query" : "House",
"type" : "boolean",
"operator" : "AND"
}
}
} ]
}

"Red House" , "Blue House" => score 0.19178301

"House" => score 0.09415865

If I change the query to only 1 field ( which doesn't make much sense) I get different results:

"dis_max" : {
"tie_breaker" : 0.0,
"queries" : [ {
"match" : {
"firstfield" : {
"query" : "House",
"type" : "boolean",
"operator" : "AND"
}
}
} ]
}

"House" => score 0.30685282

"Red House" , "Blue House" => score 0.19178301

If it would really be a pure disjunction max query it should return a score of 0.30685282 for House in both queries.

This must be either a bug or the Dis Max Query is totally useless :frowning:


(Ivan Brusic) #4

By data I mean the explanation of the query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-explain.html

Is your environment sharded? If your documents are in different shards, the
same terms might have different IDF values. If you have just a few test
documents, scores will be more consistent in a single shard environment.

Also, in your last query, the first document probably has a higher value
because of length normalization. By default, a shorter field will be more
relevant than a longer field, even if the term frequency is the same:

https://www.elastic.co/guide/en/elasticsearch/guide/master/scoring-theory.html#field-norm

Reading Lucene explanations takes practice, but all the relevant details
are in there.

Ivan


#5

Thank you for your help. I found out that IDF is the reason for this strange results. I think I was using only 1 shard the whole time. It's possible that the same term will get different IDF values even if only 1 shard is used?


(Ivan Brusic) #6

The IDF should be the same for each document in the shard. The IDF
scores/weighs the search term, not the document. It sounds like documents
are in different shards. You can view the number of shards in the GET
settings API:

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-settings.html

Elasticsearch has a default of five shards. If you read the explanation, it
will tell you which shard the document hit came from.

You can also change the search type to do a distributed query then fetch.
There is a performance hit since it needs to do another network roundtrip
between the coordinating node and the data nodes, but I think it is slight.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html#dfs-query-then-fetch

The issue is minimized as you add more documents to an index. You normally
see the issue in test indices with a handful of documents.

Ivan


(system) #7