Scoring and "Relative-ness" based on Business Rules

daviddeath · January 7, 2014, 5:36pm

What is the best way to make products more relevant outside of the default
scoring?

I have an unknown number of business rules that will dictate a document's
"relativity". Meaning, if one document scores higher than the other, it's
possible that the other document will be more relevant to the user.

Given two products with similar titles but different attributes and the
query "ipad", I'd like to promote one over the other:

{
"title_simple": "iPad Mini Case",
"description_simple": "Royce Leather iPad Mini Case:...",
"category": "Computers & Accessories",
"brand" : "Royce Leather",
"id": 794809052574
}

{
"title_simple": "Apple iPad mini (16GB, Wi-Fi + Sprint 4G, White)",
"description_simple": "iPad mini features a beautiful 7.9" display...",
"category": "Electronics",
"brand" : "Apple",
"id": 885909689712
}

A simple query scores the iPad case high:

{
"query": { "term": { "title_simple": "ipad" }}
}

But business rules dictate that the actual iPad be on the top.

I can run a filter or score based on the attribute or brand to get what I'm
looking for:

{
"query": {
"function_score": {
"query": { "term": { "title_simple": "ipad" } },
"functions" : [{
"filter" : { "term": { "category_simple": "electronics" }
},
"boost_factor" : 2
}]
}
}
}

But building a bunch of these isn't scalable or reasonable.

I have an unknown number of these and that number will continue to grow.
Some other examples:

query "xbox" should promote consoles over games
query "macbook" should promote Apple computers over macbook sleeves
query "Apple" should promote Apple products and not food

Building a thousand queries based on functions filters is unreasonable and
unscalable.

Some possible solutions I've considered:

building a lookup table that will build the filter portion of the query
(this could get unmaintainable)
Including a pre-calculated score in the document (unfortunately, doesn't
work on a per query basis, as the score may change based on the user's
needs)
Extending the DefaultSimilary class (I'm not sure how this helps me in
this scenario, though)

What have other people done to solve these problems? Is there something
else that I'm missing that could help?

Here's a runnable gist -
https://gist.github.com/dlmitchell/826e8fb7ca89bed30e4a/raw/613be2c202b26faaaa5899bdcfeac714737beb49/sample_mapping.sh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/70849d62-822a-4bb6-99f4-d9400d091fa9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Justin_Treher · January 7, 2014, 7:50pm

I think you will find that for small documents, that aren't actually
documents at all, but really a mass of data points, such as a product
library, you won't even use the built in scoring at all. The built in
scoring works well for books and articles (long works of text). For a
product library, you will use an array of custom boosts through the
function score query. The key is to get all those data points in your
documents so that you can boost on matches.

For example, with "xbox," you could have a keywords field that includes
xbox just for consoles. Maybe Xbox is the title of the product while games
just have Xbox listed as their console compatibility. Only matches in the
titles will score higher.

For the macbook, you could have an accessories flag where items flagged as
an accessory receive a negative boost.

For Apple food vs. Apple products, you can use sales data or user history.

The key to having relevancy that works for your organization is by
providing all the data points to elasticsearch to base its decisions. For
products, your best solution is a big old set of constant score queries
wrapped in some wild function score queries.

On Tuesday, January 7, 2014 12:36:43 PM UTC-5, David Mitchell wrote:

What is the best way to make products more relevant outside of the default
scoring?

I have an unknown number of business rules that will dictate a document's
"relativity". Meaning, if one document scores higher than the other, it's
possible that the other document will be more relevant to the user.

Given two products with similar titles but different attributes and the
query "ipad", I'd like to promote one over the other:

{
"title_simple": "iPad Mini Case",
"description_simple": "Royce Leather iPad Mini Case:...",
"category": "Computers & Accessories",
"brand" : "Royce Leather",
"id": 794809052574
}

{
"title_simple": "Apple iPad mini (16GB, Wi-Fi + Sprint 4G, White)",
"description_simple": "iPad mini features a beautiful 7.9" display...",
"category": "Electronics",
"brand" : "Apple",
"id": 885909689712
}

A simple query scores the iPad case high:

{
"query": { "term": { "title_simple": "ipad" }}
}

But business rules dictate that the actual iPad be on the top.

I can run a filter or score based on the attribute or brand to get what
I'm looking for:

{
"query": {
"function_score": {
"query": { "term": { "title_simple": "ipad" } },
"functions" : [{
"filter" : { "term": { "category_simple": "electronics"
} },
"boost_factor" : 2
}]
}
}
}

But building a bunch of these isn't scalable or reasonable.

I have an unknown number of these and that number will continue to grow.
Some other examples:

query "xbox" should promote consoles over games

query "macbook" should promote Apple computers over macbook sleeves

query "Apple" should promote Apple products and not food

Building a thousand queries based on functions filters is unreasonable and
unscalable.

Some possible solutions I've considered:

building a lookup table that will build the filter portion of the query
(this could get unmaintainable)

Including a pre-calculated score in the document (unfortunately, doesn't
work on a per query basis, as the score may change based on the user's
needs)

Extending the DefaultSimilary class (I'm not sure how this helps me in
this scenario, though)

What have other people done to solve these problems? Is there something
else that I'm missing that could help?

Here's a runnable gist -
https://gist.github.com/dlmitchell/826e8fb7ca89bed30e4a/raw/613be2c202b26faaaa5899bdcfeac714737beb49/sample_mapping.sh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/48fb3984-a23c-4d95-aa34-e8e67dce8df9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

daviddeath · January 7, 2014, 10:03pm

Thanks for your answer.

So, instead of relying on queries to pull out the right stuff, you're
suggesting to model the documents to the queries.

This suggests that there's a custom boost for every search term, which is
what I was hoping to avoid, if only because of the impossible task of going
through all our data and determining what to boost/not boost. This also
implies that there's another key/value store of queries-to-boost keywords,
which again could get costly to maintain.

If I'm understanding you correctly, it would look similar to what I
previously posted, but only with a larger (possibly dynamic) set of boost
queries.

Doing so is primarily a manual task - are there more automatic ways to
build up relevancy, or even tools/processes that help?

On Tuesday, January 7, 2014 11:50:40 AM UTC-8, Justin Treher wrote:

I think you will find that for small documents, that aren't actually
documents at all, but really a mass of data points, such as a product
library, you won't even use the built in scoring at all. The built in
scoring works well for books and articles (long works of text). For a
product library, you will use an array of custom boosts through the
function score query. The key is to get all those data points in your
documents so that you can boost on matches.

For example, with "xbox," you could have a keywords field that includes
xbox just for consoles. Maybe Xbox is the title of the product while games
just have Xbox listed as their console compatibility. Only matches in the
titles will score higher.

For the macbook, you could have an accessories flag where items flagged as
an accessory receive a negative boost.

For Apple food vs. Apple products, you can use sales data or user history.

The key to having relevancy that works for your organization is by
providing all the data points to elasticsearch to base its decisions. For
products, your best solution is a big old set of constant score queries
wrapped in some wild function score queries.

On Tuesday, January 7, 2014 12:36:43 PM UTC-5, David Mitchell wrote:

What is the best way to make products more relevant outside of the
default scoring?

I have an unknown number of business rules that will dictate a document's
"relativity". Meaning, if one document scores higher than the other, it's
possible that the other document will be more relevant to the user.

Given two products with similar titles but different attributes and the
query "ipad", I'd like to promote one over the other:

{
"title_simple": "iPad Mini Case",
"description_simple": "Royce Leather iPad Mini Case:...",
"category": "Computers & Accessories",
"brand" : "Royce Leather",
"id": 794809052574
}

{
"title_simple": "Apple iPad mini (16GB, Wi-Fi + Sprint 4G, White)",
"description_simple": "iPad mini features a beautiful 7.9" display..."
,
"category": "Electronics",
"brand" : "Apple",
"id": 885909689712
}

A simple query scores the iPad case high:

{
"query": { "term": { "title_simple": "ipad" }}
}

But business rules dictate that the actual iPad be on the top.

I can run a filter or score based on the attribute or brand to get what
I'm looking for:

{
"query": {
"function_score": {
"query": { "term": { "title_simple": "ipad" } },
"functions" : [{
"filter" : { "term": { "category_simple": "electronics"
} },
"boost_factor" : 2
}]
}
}
}

But building a bunch of these isn't scalable or reasonable.

I have an unknown number of these and that number will continue to grow.
Some other examples:

query "xbox" should promote consoles over games

query "macbook" should promote Apple computers over macbook sleeves

query "Apple" should promote Apple products and not food

Building a thousand queries based on functions filters is unreasonable
and unscalable.

Some possible solutions I've considered:

building a lookup table that will build the filter portion of the query
(this could get unmaintainable)

Including a pre-calculated score in the document (unfortunately,
doesn't work on a per query basis, as the score may change based on the
user's needs)

Extending the DefaultSimilary class (I'm not sure how this helps me in
this scenario, though)

What have other people done to solve these problems? Is there something
else that I'm missing that could help?

Here's a runnable gist -
https://gist.github.com/dlmitchell/826e8fb7ca89bed30e4a/raw/613be2c202b26faaaa5899bdcfeac714737beb49/sample_mapping.sh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/79abb91e-1be3-430a-b23d-a1582fae525b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.