Scoring is lower after adding new field to a document


(philipDS) #1

Hey,

We're building an application that allows people to save the Twitter handle
of a person and an associated tweet. The user has to select the handle and
then it fetches the tweet using the Twitter API. This information is stored
in an ElasticSearch document afterwards. The Twitter field is initialized
at {} (empty JSON object, we're using a Javascript ES wrapper). The
following is (part of) the mapping:

user: { type: 'String', index: 'not_analyzed' },
location : { type: 'String', analyzer: 'lowercase_keyword' },
tags : { type: 'String', index: 'not_analyzed' },
twitter : {
properties : {
handle : { type: 'String', index: 'not_analyzed' },
tweet : { type: 'String', index: 'not_analyzed' },
age : { type: 'Date' }
},
type : 'nested'
}

The location and tags fields are already filled in when we find the handle
and tweet of the user. When we find a handle and a tweet, we set all the
twitter fields (including the age, which is set at the current date), so it
becomes something like:
{
handle: 'twitter_handle',
tweet: 'my last tweet',
age: '....'
}

Now, after this twitter info is added, the score of the document changes,
when using a query that does not search these specific twitter fields. The
following query is used to search for documents we need:

query {
bool: {
must: {
[{
text: {
'tags': {
query: 'Technology',
operator: 'and'
}
}
},
{
text: {
'location': {
query: 'Madrid',
operator: 'and'
}
}
},
{
text: {
'user' : {
query: queryArgs.email,
operator: 'and'
}
}
}]
}
}
}

I have set the explanation to true and see that the score is lower after
adding the twitter information (to the same document as the tags, location
and user), but I don't understand why. Does this have to do with the term
frequency factor tf? Or what's the exact explanation for this behaviour?
Should I keep this twitter specific information in another type in the
index, with a link (by ID) to it to not affect the score? How can I avoid
the change in score in general while still adding fields to the document?

Thanks!


(philipDS) #2

Anyone that could help me?

On Monday, 2 July 2012 14:56:09 UTC+2, philipDS wrote:

Hey,

We're building an application that allows people to save the Twitter
handle of a person and an associated tweet. The user has to select the
handle and then it fetches the tweet using the Twitter API. This
information is stored in an ElasticSearch document afterwards. The Twitter
field is initialized at {} (empty JSON object, we're using a Javascript ES
wrapper). The following is (part of) the mapping:

user: { type: 'String', index: 'not_analyzed' },
location : { type: 'String', analyzer: 'lowercase_keyword' },
tags : { type: 'String', index: 'not_analyzed' },
twitter : {
properties : {
handle : { type: 'String', index: 'not_analyzed' },
tweet : { type: 'String', index: 'not_analyzed' },
age : { type: 'Date' }
},
type : 'nested'
}

The location and tags fields are already filled in when we find the handle
and tweet of the user. When we find a handle and a tweet, we set all the
twitter fields (including the age, which is set at the current date), so it
becomes something like:
{
handle: 'twitter_handle',
tweet: 'my last tweet',
age: '....'
}

Now, after this twitter info is added, the score of the document changes,
when using a query that does not search these specific twitter fields. The
following query is used to search for documents we need:

query {
bool: {
must: {
[{
text: {
'tags': {
query: 'Technology',
operator: 'and'
}
}
},
{
text: {
'location': {
query: 'Madrid',
operator: 'and'
}
}
},
{
text: {
'user' : {
query: queryArgs.email,
operator: 'and'
}
}
}]
}
}
}

I have set the explanation to true and see that the score is lower after
adding the twitter information (to the same document as the tags, location
and user), but I don't understand why. Does this have to do with the term
frequency factor tf? Or what's the exact explanation for this
behaviour? Should I keep this twitter specific information in another type
in the index, with a link (by ID) to it to not affect the score? How can I
avoid the change in score in general while still adding fields to the
document?

Thanks!


(Jörg Prante) #3

Have you disabled the _all field? It seems you are not interested in it, so
you can just disable it. With the mapping shown, at least the "age" field
will also get indexed into the _all field and this might change the score.

Best regards,

Jörg

On Wednesday, July 4, 2012 12:01:05 PM UTC+2, philipDS wrote:

Anyone that could help me?

On Monday, 2 July 2012 14:56:09 UTC+2, philipDS wrote:

Hey,

We're building an application that allows people to save the Twitter
handle of a person and an associated tweet. The user has to select the
handle and then it fetches the tweet using the Twitter API. This
information is stored in an ElasticSearch document afterwards. The Twitter
field is initialized at {} (empty JSON object, we're using a Javascript ES
wrapper). The following is (part of) the mapping:

user: { type: 'String', index: 'not_analyzed' },
location : { type: 'String', analyzer: 'lowercase_keyword' },
tags : { type: 'String', index: 'not_analyzed' },
twitter : {
properties : {
handle : { type: 'String', index: 'not_analyzed' },
tweet : { type: 'String', index: 'not_analyzed' },
age : { type: 'Date' }
},
type : 'nested'
}

The location and tags fields are already filled in when we find the
handle and tweet of the user. When we find a handle and a tweet, we set all
the twitter fields (including the age, which is set at the current date),
so it becomes something like:
{
handle: 'twitter_handle',
tweet: 'my last tweet',
age: '....'
}

Now, after this twitter info is added, the score of the document changes,
when using a query that does not search these specific twitter fields. The
following query is used to search for documents we need:

query {
bool: {
must: {
[{
text: {
'tags': {
query: 'Technology',
operator: 'and'
}
}
},
{
text: {
'location': {
query: 'Madrid',
operator: 'and'
}
}
},
{
text: {
'user' : {
query: queryArgs.email,
operator: 'and'
}
}
}]
}
}
}

I have set the explanation to true and see that the score is lower after
adding the twitter information (to the same document as the tags, location
and user), but I don't understand why. Does this have to do with the term
frequency factor tf? Or what's the exact explanation for this
behaviour? Should I keep this twitter specific information in another type
in the index, with a link (by ID) to it to not affect the score? How can I
avoid the change in score in general while still adding fields to the
document?

Thanks!


(philipDS) #4

I just removed all my indexes and disabled the all field:

this.liMapping = {
contacts: {
"_all" : { "enabled" : false },
properties: {
// mapping here
}
}
};

Still the same result though :frowning: My query also doesn't search on the _all
field, but on specific fields. So, I don't know if this _all field has
anything to do with this issue?

On Wednesday, 4 July 2012 12:19:57 UTC+2, Jörg Prante wrote:

Have you disabled the _all field? It seems you are not interested in it,
so you can just disable it. With the mapping shown, at least the "age"
field will also get indexed into the _all field and this might change the
score.

Best regards,

Jörg

On Wednesday, July 4, 2012 12:01:05 PM UTC+2, philipDS wrote:

Anyone that could help me?

On Monday, 2 July 2012 14:56:09 UTC+2, philipDS wrote:

Hey,

We're building an application that allows people to save the Twitter
handle of a person and an associated tweet. The user has to select the
handle and then it fetches the tweet using the Twitter API. This
information is stored in an ElasticSearch document afterwards. The Twitter
field is initialized at {} (empty JSON object, we're using a Javascript ES
wrapper). The following is (part of) the mapping:

user: { type: 'String', index: 'not_analyzed' },
location : { type: 'String', analyzer: 'lowercase_keyword' },
tags : { type: 'String', index: 'not_analyzed' },
twitter : {
properties : {
handle : { type: 'String', index: 'not_analyzed' },
tweet : { type: 'String', index: 'not_analyzed' },
age : { type: 'Date' }
},
type : 'nested'
}

The location and tags fields are already filled in when we find the
handle and tweet of the user. When we find a handle and a tweet, we set all
the twitter fields (including the age, which is set at the current date),
so it becomes something like:
{
handle: 'twitter_handle',
tweet: 'my last tweet',
age: '....'
}

Now, after this twitter info is added, the score of the document
changes, when using a query that does not search these specific twitter
fields. The following query is used to search for documents we need:

query {
bool: {
must: {
[{
text: {
'tags': {
query: 'Technology',
operator: 'and'
}
}
},
{
text: {
'location': {
query: 'Madrid',
operator: 'and'
}
}
},
{
text: {
'user' : {
query: queryArgs.email,
operator: 'and'
}
}
}]
}
}
}

I have set the explanation to true and see that the score is lower after
adding the twitter information (to the same document as the tags, location
and user), but I don't understand why. Does this have to do with the term
frequency factor tf? Or what's the exact explanation for this
behaviour? Should I keep this twitter specific information in another type
in the index, with a link (by ID) to it to not affect the score? How can I
avoid the change in score in general while still adding fields to the
document?

Thanks!


(Clinton Gormley) #5
    I have set the explanation to true and see that the score is
    lower after adding the twitter information (to the same
    document as the tags, location and user), but I don't
    understand why. Does this have to do with the term frequency
    factor tf? Or what's the exact explanation for this behaviour?
    Should I keep this twitter specific information in another
    type in the index, with a link (by ID) to it to not affect the
    score?

Term frequency, field length, document length - it's a complex business.

Have a look at this page for a short summary:
http://www.lucenetutorial.com/advanced-topics/scoring.html

And at this page for a longer version:
http://lucene.apache.org/core/3_6_0/scoring.html

How can I avoid the change in score in general while still adding
fields to the document?

My question is: why does this matter? Scoring is relative. There are
some situations where these differences may impact your search, but
generally they'll even out.

You may want to look at the omit_norms and omit_term_freq_and_positions
options on this page:
http://www.elasticsearch.org/guide/reference/mapping/core-types.html

clint


(philipDS) #6

Thanks clint. The omit_norms or omit_term_freq_and_positions fields didn't
help (which is weird, because I really thought that was the problem).

I will read through the Lucene scoring documentation later.

More suggestions always welcome! Thanks.

2012/7/4 Clinton Gormley clint@traveljury.com

    I have set the explanation to true and see that the score is
    lower after adding the twitter information (to the same
    document as the tags, location and user), but I don't
    understand why. Does this have to do with the term frequency
    factor tf? Or what's the exact explanation for this behaviour?
    Should I keep this twitter specific information in another
    type in the index, with a link (by ID) to it to not affect the
    score?

Term frequency, field length, document length - it's a complex business.

Have a look at this page for a short summary:
http://www.lucenetutorial.com/advanced-topics/scoring.html

And at this page for a longer version:
http://lucene.apache.org/core/3_6_0/scoring.html

How can I avoid the change in score in general while still adding
fields to the document?

My question is: why does this matter? Scoring is relative. There are
some situations where these differences may impact your search, but
generally they'll even out.

You may want to look at the omit_norms and omit_term_freq_and_positions
options on this page:
http://www.elasticsearch.org/guide/reference/mapping/core-types.html

clint


(system) #7