Custom Score Query and non-numeric field values

I've been playing around with elastic search for a couple days now, and I
can't wrap my mind around how to do a specific type of query.

I want a custom score query that is a function of not only numeric field
values, but also a categorical field values. The categorical field value
should be mapped to a float based on whether it appears in a given array,
or by using a dictionary-like mapping.

For example, let's say I'm searching an index of docs representing cars,
and I want a custom score that favors cars with field 'color' in ['red',
'yellow'] (non-numeric, categorical) and cars with a low 'price' (numeric).

I had two hunches for how this might work in elasticsearch:

  1. A 'param' defined by a script, e.g. (in pseudocode)
    'param1': 2 if doc['color'] in ['red', 'yellow'] else 1.5
    ... but this did not seem to work.

  2. MVEL
    ... but an initial look through the MVEL docs and I couldn't find an
    'in array'

Is there a way to do this?

Thanks,
Dan

--

Hi Dan

I want a custom score query that is a function of not only numeric
field values, but also a categorical field values. The categorical
field value should be mapped to a float based on whether it appears in
a given array, or by using a dictionary-like mapping.

You're approaching this from the wrong angle. Elasticsearch is built to
allow easy relevance scoring based on the types of things that you are
talking about, but to get the best out of ES, you should use the native
functionality.

A custom_score approach is a when-all-else-fails approach, and there are
easier ways to do it:

For example, let's say I'm searching an index of docs representing
cars, and I want a custom score that favors cars with field 'color' in
['red', 'yellow'] (non-numeric, categorical) and cars with a low
'price' (numeric).

You want to apply a "boost" to certain documents to increase their
relevance. The criteria you are considering are all things that are
easily achievable using filters.

The easiest way to use filters to apply boost is with the
custom_filters_score query.

This could give you something like this (based on your criteria
specified above):

curl -XGET 'http://127.0.0.1:9200/products/car/_search?pretty=1' -d '
{
"query" : {
"custom_filters_score" : {
"query" : {
"match" : {
"brand" : "porsche"
}
},
"score_mode" : "total",
"filters" : [
{
"boost" : "2",
"filter" : {
"terms" : {
"color" : [
"red",
"yellow"
]
}
}
},
{
"boost" : "2",
"filter" : {
"numeric_range" : {
"price" : {
"lt" : 1000000
}
}
}
}
]
}
}
}
'

clint

--

Thanks, Clint. This makes a lot of sense.

On Friday, November 23, 2012 3:33:05 PM UTC-5, Daniel Weitzenfeld wrote:

I've been playing around with Elasticsearch for a couple days now, and I
can't wrap my mind around how to do a specific type of query.

I want a custom score query that is a function of not only numeric field
values, but also a categorical field values. The categorical field value
should be mapped to a float based on whether it appears in a given array,
or by using a dictionary-like mapping.

For example, let's say I'm searching an index of docs representing cars,
and I want a custom score that favors cars with field 'color' in ['red',
'yellow'] (non-numeric, categorical) and cars with a low 'price' (numeric).

I had two hunches for how this might work in elasticsearch:

  1. A 'param' defined by a script, e.g. (in pseudocode)
    'param1': 2 if doc['color'] in ['red', 'yellow'] else 1.5
    ... but this did not seem to work.

  2. MVEL
    ... but an initial look through the MVEL docs and I couldn't find an
    'in array'

Is there a way to do this?

Thanks,
Dan

--

Ok, I'm a bit confused again, this time about how to think about queries
vs. filters.

Would your example return non-porsches, if they were red and < 1000000? My
hunch is no, but then, isn't the 'query' operating more as a 'filter'?

Aren't the 'filters' in this example operating more like 'queries,' in that
they aren't determining whether a document is in the result set (what I
understand a 'filter' to do), but instead are affecting the score for a
document (what I understand a 'query' to do)?

If this characterization is correct - that you're using 'filters' as
'queries' - do I understand correctly that your reason for doing so is
performance related? Ie, filters are cached in ways that queries are not?

On Friday, November 23, 2012 4:06:12 PM UTC-5, Clinton Gormley wrote:

Hi Dan

I want a custom score query that is a function of not only numeric
field values, but also a categorical field values. The categorical
field value should be mapped to a float based on whether it appears in
a given array, or by using a dictionary-like mapping.

You're approaching this from the wrong angle. Elasticsearch is built to
allow easy relevance scoring based on the types of things that you are
talking about, but to get the best out of ES, you should use the native
functionality.

A custom_score approach is a when-all-else-fails approach, and there are
easier ways to do it:

For example, let's say I'm searching an index of docs representing
cars, and I want a custom score that favors cars with field 'color' in
['red', 'yellow'] (non-numeric, categorical) and cars with a low
'price' (numeric).

You want to apply a "boost" to certain documents to increase their
relevance. The criteria you are considering are all things that are
easily achievable using filters.

The easiest way to use filters to apply boost is with the
custom_filters_score query.

Elasticsearch Platform — Find real-time answers at scale | Elastic

This could give you something like this (based on your criteria
specified above):

curl -XGET 'http://127.0.0.1:9200/products/car/_search?pretty=1' -d '
{
"query" : {
"custom_filters_score" : {
"query" : {
"match" : {
"brand" : "porsche"
}
},
"score_mode" : "total",
"filters" : [
{
"boost" : "2",
"filter" : {
"terms" : {
"color" : [
"red",
"yellow"
]
}
}
},
{
"boost" : "2",
"filter" : {
"numeric_range" : {
"price" : {
"lt" : 1000000
}
}
}
}
]
}
}
}
'

clint

--

Hi Daniel

Ok, I'm a bit confused again, this time about how to think about
queries vs. filters.

Would your example return non-porsches, if they were red and
< 1000000? My hunch is no, but then, isn't the 'query' operating more
as a 'filter'?

The query itself determines what is returned. So in the example I gave,
the query will only return cars whose brand is "porsche".

The custom_filters_score query simply uses filters to manipulate the
score. So: if the document matches this filter, then change the score
by X

This isn't the same as using filters in eg a "filtered" query, where the
meaning is: exclude this document if it matches this filter.

If this characterization is correct - that you're using 'filters' as
'queries' - do I understand correctly that your reason for doing so is
performance related? Ie, filters are cached in ways that queries are
not?

re "filters as queries", if you mean that I'm using them for scoring as
opposed to traditional include/exclude filtering, then yes. And yes,
because they are cached, they are very efficient

clint

--