Question about _boost result ordering


(Kellan) #1

I have documents that look something like:

{ authorId: "9c76e24a8586f3389b2e9758", _boost: 2.54631, keywords:
[suspense, mystery], <other values ...>}

I have noticed that when I search by authorId, the results are roughly
ordered by the boost value, but something else is contributing to the
final _score for sorting. The documents all have only 1 author, so the
match is exact and there isn't anything else in the author field to
skew the result ordering. In one case, it seems that documents with
fewer keywords are getting a small boost. Any ideas on why this might
be happening? The mapping for keywords is:

        keywords: {
            type: "string",
            store: "no",
            index: "analyzed",
            analyzer: "snowball"
        },

while all other fields are defaulted. My query is:

query: {
    term: {
        authorId: "9c76e24a8586f3389b2e9758"
    }
}

Kellan


(Clinton Gormley) #2

Hi Kellan

{ authorId: "9c76e24a8586f3389b2e9758", _boost: 2.54631, keywords:
[suspense, mystery], <other values ...>}

I have noticed that when I search by authorId, the results are roughly
ordered by the boost value, but something else is contributing to the
final _score for sorting. The documents all have only 1 author, so the
match is exact and there isn't anything else in the author field to
skew the result ordering. In one case, it seems that documents with
fewer keywords are getting a small boost. Any ideas on why this might
be happening? The mapping for keywords is:

        keywords: {
            type: "string",
            store: "no",
            index: "analyzed",
            analyzer: "snowball"
        },

while all other fields are defaulted.

You should probably set your authorId to {"index: "not_analyzed"}
because it is a fixed value, you don't want it to be analyzed at all.

My query is:

query: {
    term: {
        authorId: "9c76e24a8586f3389b2e9758"
    }
}

The score is calculated from a number of values, including:

  • the boost that you specified
  • how frequently your term appears in all your docs (eg
    'smith' appears very frequently, and so is less important
    than 'gormley'
  • how frequently the term appears in the field
  • what percentage of the field consists of your term

Two options here:

You could use a filter to search for authorId (all authorId values
would have _score = 1).

But you're specifying a custom boost per doc, so presumably you're
wanting some authors to be more important than others.

In this case, you should set the authorId field to {omit_norms: true}.

You can read more about norms here:
http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e71
http://www.lucidimagination.com/blog/2009/03/18/exploring-lucenes-indexing-code-part-2/

clint


(Kellan) #3

Clint,

Thanks for the suggestion of using "not_analyzed".

I tried the "omit_norms" suggestion. But this led to even more
confusing behavior i.e. the 10 search results all had a score of
either 8.836764 or 8.300338 and it seemed to have nothing to do with
the _boost value.

The score is calculated from a number of values, including:

  • the boost that you specified
  • how frequently your term appears in all your docs (eg
    'smith' appears very frequently, and so is less important
    than 'gormley'
  • how frequently the term appears in the field
  • what percentage of the field consists of your term

I'm not trying to boost some authors more than others. Rather, I'm
trying to boost some documents more than others (even by the same
author). I guess if I search for a single author, it seems like the
results should be sorted purely by the boost value as there is nothing
else to make the search prefer one document over another.

One thing is very peculiar ... often documents with different boost
values have exactly the same _score (at least to 5 decimal places).
This seems to happen much more often than coincidence would suggest.

Kellan


(system) #4