Help to understand the explain output and scoring


(Davi Alexandre) #1

Hi,

I have a movies index with thousands of documents.

Two of these documents are:


And I’m using this mapping definition, with some boosting for the most important fields:

The problem is that running this query I got a result very different than what I expected:

{
"query": {
"query_string": {
"query": "rio"
}
}
}

As you can see, both documents have 4 matches for the query (1 in titulo_portugues, 1 in titulo_original, 1 in sinopse and 1 in elenco), but in 16855 it matches exactly the values in titulo_portugues and titulo_original. Because of this, I was expecting document 16855 to have the biggest score (or, at least, one of the biggest), but actually it getting a very low score. Much lower than document 2739, which has the biggest score for this query. The query returns a total of 456 hits, and 16855 is one the last documents!

Trying to understand what is happening, I’ve used explain and got this:


I see both documents scores exactly the same for tf and idf, but 2739 got a higher score for fieldNorm, which I don’t understand what it is and how it is calculated. Would someone help understand this?

Thanks!

--
Davi Alexandre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52a63d0d.2ae8944a.2c8%40Macbook.
For more options, visit https://groups.google.com/groups/opt_out.


(Luca Cavanna) #2

Hi,
field norm contains index time boosting (if you used it) and takes into
account the length of the field. Shorter fields are scored higher than
longer ones as they tend to summarize better thus to be more relevant.
If you don't like this and you don't need index time boosting either, you
can disable norms on your mapping per field (omit_norms=true), just
remember it's information stored in the lucene index, thus you need to
reindex your documents to apply the change.

On Monday, December 9, 2013 10:58:32 PM UTC+1, Davi Alexandre wrote:

Hi,

I have a movies index with thousands of documents.

Two of these documents are:

https://gist.github.com/davialexandre/7881281#file-16855-json
https://gist.github.com/davialexandre/7881281#file-2739-json

And I’m using this mapping definition, with some boosting for the most
important fields:

https://gist.github.com/davialexandre/7881281#file-mapping-json

The problem is that running this query I got a result very different than
what I expected:

{
"query": {
"query_string": {
"query": "rio"
}
}
}

As you can see, both documents have 4 matches for the query (1 in
titulo_portugues, 1 in titulo_original, 1 in sinopse and 1 in elenco), but
in 16855 it matches exactly the values in titulo_portugues and
titulo_original. Because of this, I was expecting document 16855 to have
the biggest score (or, at least, one of the biggest), but actually it
getting a very low score. Much lower than document 2739, which has the
biggest score for this query. The query returns a total of 456 hits, and
16855 is one the last documents!

Trying to understand what is happening, I’ve used explain and got this:

https://gist.github.com/davialexandre/7881281#file-explain_16855-json
https://gist.github.com/davialexandre/7881281#file-explain_2739-json

I see both documents scores exactly the same for tf and idf, but 2739 got
a higher score for fieldNorm, which I don’t understand what it is and how
it is calculated. Would someone help understand this?

Thanks!

--
Davi Alexandre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0d87adf8-3a91-4ccd-8bbb-238672a15b5f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Davi Alexandre) #3

I was suspecting it was related to field length, but why 16855 have a lower score than 2739 when it has shorter fields that matches exactly the query term?

--
Davi Alexandre

Em 10 de dezembro de 2013 at 08:24:38, Luca Cavanna (cavannaluca@gmail.com) escreveu:

Hi,
field norm contains index time boosting (if you used it) and takes into account the length of the field. Shorter fields are scored higher than longer ones as they tend to summarize better thus to be more relevant.
If you don't like this and you don't need index time boosting either, you can disable norms on your mapping per field (omit_norms=true), just remember it's information stored in the lucene index, thus you need to reindex your documents to apply the change.

On Monday, December 9, 2013 10:58:32 PM UTC+1, Davi Alexandre wrote:
Hi,

I have a movies index with thousands of documents.

Two of these documents are:


And I’m using this mapping definition, with some boosting for the most important fields:

The problem is that running this query I got a result very different than what I expected:

{
"query": {
"query_string": {
"query": "rio"
}
}
}

As you can see, both documents have 4 matches for the query (1 in titulo_portugues, 1 in titulo_original, 1 in sinopse and 1 in elenco), but in 16855 it matches exactly the values in titulo_portugues and titulo_original. Because of this, I was expecting document 16855 to have the biggest score (or, at least, one of the biggest), but actually it getting a very low score. Much lower than document 2739, which has the biggest score for this query. The query returns a total of 456 hits, and 16855 is one the last documents!

Trying to understand what is happening, I’ve used explain and got this:


I see both documents scores exactly the same for tf and idf, but 2739 got a higher score for fieldNorm, which I don’t understand what it is and how it is calculated. Would someone help understand this?

Thanks!

--
Davi Alexandre

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0d87adf8-3a91-4ccd-8bbb-238672a15b5f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52a7095a.643c9869.34ba%40Macbook.
For more options, visit https://groups.google.com/groups/opt_out.


(benjamin leviant) #4

Hi Davi,

In your query the field _all is used to find matching documents. This is a
special field which is indexed by default with the contents of "all" fields
for each document.

Because the document 16855 has much more values in its fields than the
document 2739 (elenco, curiosidades, ...), it gets lower scores for queries
on the field "_all".

You should specify on which field you want to query, for example :

{
"query": {
"query_string": {
"fields" : ["titulo_portugues", "titulo_origina"],
"query": "rio"
}
}
}

I hope it can help you.

Regards

Benjamin

On Tue, Dec 10, 2013 at 1:30 PM, Davi Alexandre
davi@davialexandre.com.brwrote:

I was suspecting it was related to field length, but why 16855 have a
lower score than 2739 when it has shorter fields that matches exactly the
query term?

--
Davi Alexandre

Em 10 de dezembro de 2013 at 08:24:38, Luca Cavanna (cavannaluca@gmail.com//cavannaluca@gmail.com)
escreveu:

Hi,
field norm contains index time boosting (if you used it) and takes into
account the length of the field. Shorter fields are scored higher than
longer ones as they tend to summarize better thus to be more relevant.
If you don't like this and you don't need index time boosting either, you
can disable norms on your mapping per field (omit_norms=true), just
remember it's information stored in the lucene index, thus you need to
reindex your documents to apply the change.

On Monday, December 9, 2013 10:58:32 PM UTC+1, Davi Alexandre wrote:

Hi,

I have a movies index with thousands of documents.

Two of these documents are:

https://gist.github.com/davialexandre/7881281#file-16855-json
https://gist.github.com/davialexandre/7881281#file-2739-json

And I’m using this mapping definition, with some boosting for the most
important fields:

https://gist.github.com/davialexandre/7881281#file-mapping-json

The problem is that running this query I got a result very different
than what I expected:

{
"query": {
"query_string": {
"query": "rio"
}
}
}

As you can see, both documents have 4 matches for the query (1 in
titulo_portugues, 1 in titulo_original, 1 in sinopse and 1 in elenco), but
in 16855 it matches exactly the values in titulo_portugues and
titulo_original. Because of this, I was expecting document 16855 to have
the biggest score (or, at least, one of the biggest), but actually it
getting a very low score. Much lower than document 2739, which has the
biggest score for this query. The query returns a total of 456 hits, and
16855 is one the last documents!

Trying to understand what is happening, I’ve used explain and got this:

https://gist.github.com/davialexandre/7881281#file-explain_16855-json
https://gist.github.com/davialexandre/7881281#file-explain_2739-json

I see both documents scores exactly the same for tf and idf, but 2739 got
a higher score for fieldNorm, which I don’t understand what it is and how
it is calculated. Would someone help understand this?

Thanks!

--
Davi Alexandre

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0d87adf8-3a91-4ccd-8bbb-238672a15b5f%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/etPan.52a7095a.643c9869.34ba%40Macbook
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2BcNwfvXwUXjn0O04V0z8L2Y6_aqHS%2BPj_vi%3DCqL-QvumXay6A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Davi Alexandre) #5

Thanks, Benjamin! This give me the expected score.

I was thinking _all was just some kind of some kind of wildcard for all fields and not a field with the contents of all fields. My mistake :frowning:

Thanks again for the help!

Davi Alexandre

Em 10 de dezembro de 2013 at 11:10:00, benjamin leviant (benjamin.leviant@gmail.com) escreveu:

Hi Davi,

In your query the field _all is used to find matching documents. This is a special field which is indexed by default with the contents of "all" fields for each document.

Because the document 16855 has much more values in its fields than the document 2739 (elenco, curiosidades, ...), it gets lower scores for queries on the field "_all".

You should specify on which field you want to query, for example :

{
"query": {
"query_string": {
"fields" : ["titulo_portugues", "titulo_origina"],
"query": "rio"
}
}
}

I hope it can help you.

Regards

Benjamin

On Tue, Dec 10, 2013 at 1:30 PM, Davi Alexandre davi@davialexandre.com.br wrote:
I was suspecting it was related to field length, but why 16855 have a lower score than 2739 when it has shorter fields that matches exactly the query term?

--
Davi Alexandre

Em 10 de dezembro de 2013 at 08:24:38, Luca Cavanna (cavannaluca@gmail.com) escreveu:

Hi,
field norm contains index time boosting (if you used it) and takes into account the length of the field. Shorter fields are scored higher than longer ones as they tend to summarize better thus to be more relevant.
If you don't like this and you don't need index time boosting either, you can disable norms on your mapping per field (omit_norms=true), just remember it's information stored in the lucene index, thus you need to reindex your documents to apply the change.

On Monday, December 9, 2013 10:58:32 PM UTC+1, Davi Alexandre wrote:
Hi,

I have a movies index with thousands of documents.

Two of these documents are:


And I’m using this mapping definition, with some boosting for the most important fields:

The problem is that running this query I got a result very different than what I expected:

{
"query": {
"query_string": {
"query": "rio"
}
}
}

As you can see, both documents have 4 matches for the query (1 in titulo_portugues, 1 in titulo_original, 1 in sinopse and 1 in elenco), but in 16855 it matches exactly the values in titulo_portugues and titulo_original. Because of this, I was expecting document 16855 to have the biggest score (or, at least, one of the biggest), but actually it getting a very low score. Much lower than document 2739, which has the biggest score for this query. The query returns a total of 456 hits, and 16855 is one the last documents!

Trying to understand what is happening, I’ve used explain and got this:


I see both documents scores exactly the same for tf and idf, but 2739 got a higher score for fieldNorm, which I don’t understand what it is and how it is calculated. Would someone help understand this?

Thanks!

--
Davi Alexandre

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0d87adf8-3a91-4ccd-8bbb-238672a15b5f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52a7095a.643c9869.34ba%40Macbook.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2BcNwfvXwUXjn0O04V0z8L2Y6_aqHS%2BPj_vi%3DCqL-QvumXay6A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52a75e06.507ed7ab.34ba%40Macbook.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6