Handling tag "weights"

Some of my documents have the following: "tag: [dog, animal, pets]".
Is it possible to:

  1. influence index scoring based on score for each tag with something
    along "tag: {dog: 0.5, animal: 0.3, pets: 0.4}"

  2. influence search scoring for each word, so on "hello dog" I would
    decide that "hello" is 0.3 and "dog" is 0.8 (I believe this is
    hello^0.3, dog^0.8 in lucene)

Thanks!

Index scoring options is really just the boost option (on a field or a
document). There is a way to store extra information per term indexed, and
then have a custom query that takes that into account when scoring (thats
actually what the _all support does internally to support having custom
boost value per field).

In order to support something similar to boosting, just on term level (like
is this case), then it need s to be exposed. Not terribly difficult to
implement, just interesting on how to best expose it as an API.

Regarding custom search time scoring, which usually is enough and not
require index time scoring, then you can either you boosting per query
created (each query element accepts a boost, like term query or field
query), which is actually the same as the dog^0.3. Or use the custom_score
query for complete, script level support for scoring.

-shay.banon

On Tue, Aug 31, 2010 at 7:47 PM, brandonlee mluggy@gmail.com wrote:

Some of my documents have the following: "tag: [dog, animal, pets]".
Is it possible to:

  1. influence index scoring based on score for each tag with something
    along "tag: {dog: 0.5, animal: 0.3, pets: 0.4}"

  2. influence search scoring for each word, so on "hello dog" I would
    decide that "hello" is 0.3 and "dog" is 0.8 (I believe this is
    hello^0.3, dog^0.8 in lucene)

Thanks!

While we're waiting for it to be exposed on the index API, do you
think one of the following would work for "tag: {dog: 0.5, animal:
0.3, pets: 0.4}"? which is better?

  1. Duplicating the number of occurrences for each tag, so we'll have 5
    times "dog", 3 times "animal" and 4 times "pets"
  2. Defining 3 fields with "dog" on a 0.5 boosted "tag05" field,
    "animal" on a 0.3 boosted "tag03", etc.
  3. Defining 3 fields (tag1-tag3) but setting different boost levels
    based on each document's list of tags

I keep reading your comments that custom_score query is slow but how
slow? is there any limitations to the number of variables?

Thanks Shay!

On Aug 31, 9:24 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Index scoring options is really just the boost option (on a field or a
document). There is a way to store extra information per term indexed, and
then have a custom query that takes that into account when scoring (thats
actually what the _all support does internally to support having custom
boost value per field).

In order to support something similar to boosting, just on term level (like
is this case), then it need s to be exposed. Not terribly difficult to
implement, just interesting on how to best expose it as an API.

Regarding custom search time scoring, which usually is enough and not
require index time scoring, then you can either you boosting per query
created (each query element accepts a boost, like term query or field
query), which is actually the same as the dog^0.3. Or use the custom_score
query for complete, script level support for scoring.

-shay.banon

On Tue, Aug 31, 2010 at 7:47 PM, brandonlee mlu...@gmail.com wrote:

Some of my documents have the following: "tag: [dog, animal, pets]".
Is it possible to:

  1. influence index scoring based on score for each tag with something
    along "tag: {dog: 0.5, animal: 0.3, pets: 0.4}"
  1. influence search scoring for each word, so on "hello dog" I would
    decide that "hello" is 0.3 and "dog" is 0.8 (I believe this is
    hello^0.3, dog^0.8 in lucene)

Thanks!

I think that you should first try and solve it on the search side. If you
know the weights when you do a search, you can either construct a query that
applies different boosts depending the tag, or use custom_score query.

custom_score query is slower than other queries, but I suggest you run and
check if its ok for you (with actual data, and relevant index size). The
good thing is that if its slow for you (and slow here means both latency and
QPS under load), you can always add more replicas and more machines to
separate the load.

-shay.banon

On Tue, Aug 31, 2010 at 10:04 PM, brandonlee mluggy@gmail.com wrote:

While we're waiting for it to be exposed on the index API, do you
think one of the following would work for "tag: {dog: 0.5, animal:
0.3, pets: 0.4}"? which is better?

  1. Duplicating the number of occurrences for each tag, so we'll have 5
    times "dog", 3 times "animal" and 4 times "pets"
  2. Defining 3 fields with "dog" on a 0.5 boosted "tag05" field,
    "animal" on a 0.3 boosted "tag03", etc.
  3. Defining 3 fields (tag1-tag3) but setting different boost levels
    based on each document's list of tags

I keep reading your comments that custom_score query is slow but how
slow? is there any limitations to the number of variables?

Thanks Shay!

On Aug 31, 9:24 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Index scoring options is really just the boost option (on a field or a
document). There is a way to store extra information per term indexed,
and
then have a custom query that takes that into account when scoring (thats
actually what the _all support does internally to support having custom
boost value per field).

In order to support something similar to boosting, just on term level
(like
is this case), then it need s to be exposed. Not terribly difficult to
implement, just interesting on how to best expose it as an API.

Regarding custom search time scoring, which usually is enough and not
require index time scoring, then you can either you boosting per query
created (each query element accepts a boost, like term query or field
query), which is actually the same as the dog^0.3. Or use the
custom_score
query for complete, script level support for scoring.

-shay.banon

On Tue, Aug 31, 2010 at 7:47 PM, brandonlee mlu...@gmail.com wrote:

Some of my documents have the following: "tag: [dog, animal, pets]".
Is it possible to:

  1. influence index scoring based on score for each tag with something
    along "tag: {dog: 0.5, animal: 0.3, pets: 0.4}"
  1. influence search scoring for each word, so on "hello dog" I would
    decide that "hello" is 0.3 and "dog" is 0.8 (I believe this is
    hello^0.3, dog^0.8 in lucene)

Thanks!