Product/price modeling with nested sort problem

Hi community !

I'm posting here about a data modeling problem we are facing with
elasticsearch. Here is the situation:

Our business model is relatively simple: we have products that each
contains a (potentially large) set of prices. Both product and price have a
dozen of (indexed) field each.

The number of product remains reasonable with only few Ks while the price
cardinality for each product is very volatile: It might vary from few Ks to
100Ks or more for some products. We currently have ~5K products with a
total of about ~16M prices, and these figures will grow in a near future.

Our search requests are done on products with possible criteria on both
product and product's prices (elasticsearch fits all our filtering /
faceting needs) but with one particularity: we always need to have products
sorted by best price matching the provided criterias.

We tried first to load a price index with duplicated product data for each
price. But as grouping is not implemented in elasticsearch afaik, we failed
fast on this modeling (we only want distinct products in the result).

The next logic approach was then to embed prices into products using nested
mapping to simulate grouping. Therefore, we were able to correctly sort
products on their best price matching criterias using nested sort and
nested filters. But once the whole dataset loaded (our 5K products with
embedded prices), we suffered high response time on some queries (~700ms),
especially with those which do not filter enough the dataset.

We think this can be explained because some price's field values are
unevenly distributed across prices (certain values predominate over others)
and products (some prices are widely spread over products).

So we tried to split our dataset into specialized product indices: we
created several ones in which products contain only prices matching a
particular criteria value (the ones that predominates) and route search
accordingly. Response time was a lot better (~100ms). We could even mix one
or more indices if we need to search on multiple field values.

Nonetheless, in the latter case, we noted that the nested sort result is
not merged during the gather phase: some products can appear several time
in search involving multiple indices (i.e. we lose grouping), probably
because a product can appear in these specialized indices with same price
value in it.

Is it the expected behavior? If so, is there any workaround?
Do you think there is another approach to consider?

PS : we also considered parent/child but the distribution problem remains
the same (child prices are routed to the parent product's shard). In
addition, to retrieve the best price id for each result product, we
concatenate the price value and the price id in a text field, and execute
the nested sort on this field. But the parent/child sorting seems to only
be valid on numeric fields.

Nicolas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Nicolas,

Can your share your search request for the nested sort case? Just want to
look at how you use this feature (for example what sort_mode, what type of
nested filter etc.).

What prices ranges are commonly used by end users? Or are the
ranges arbitrarily chosen by the end user? Does the query also take that
long (700ms) without nested sorting?

In the second case getting duplicate products in the search result is
expected, because your logical product is indexed twice, but with different
prices.

About the parent / child approach. Sorting based on child values (when
using has_child filter / query) isn't possible at the moment. There is a
workaround that allows you to sort parent document by the child documents
fields:

In order to get the best price per returned product, you can build a
secondary request using the multi search api. Each search request in the
multi search api gets the top child document for a returned product in the
previous response (filter by the product id and use that as routing value).

Martijn

On 22 April 2013 09:10, Nicolas Colomer inoskyh@gmail.com wrote:

Hi community !

I'm posting here about a data modeling problem we are facing with
elasticsearch. Here is the situation:

Our business model is relatively simple: we have products that each
contains a (potentially large) set of prices. Both product and price have a
dozen of (indexed) field each.

The number of product remains reasonable with only few Ks while the price
cardinality for each product is very volatile: It might vary from few Ks to
100Ks or more for some products. We currently have ~5K products with a
total of about ~16M prices, and these figures will grow in a near future.

Our search requests are done on products with possible criteria on both
product and product's prices (elasticsearch fits all our filtering /
faceting needs) but with one particularity: we always need to have products
sorted by best price matching the provided criterias.

We tried first to load a price index with duplicated product data for each
price. But as grouping is not implemented in elasticsearch afaik, we failed
fast on this modeling (we only want distinct products in the result).

The next logic approach was then to embed prices into products using
nested mapping to simulate grouping. Therefore, we were able to correctly
sort products on their best price matching criterias using nested sort and
nested filters. But once the whole dataset loaded (our 5K products with
embedded prices), we suffered high response time on some queries (~700ms),
especially with those which do not filter enough the dataset.

We think this can be explained because some price's field values are
unevenly distributed across prices (certain values predominate over others)
and products (some prices are widely spread over products).

So we tried to split our dataset into specialized product indices: we
created several ones in which products contain only prices matching a
particular criteria value (the ones that predominates) and route search
accordingly. Response time was a lot better (~100ms). We could even mix one
or more indices if we need to search on multiple field values.

Nonetheless, in the latter case, we noted that the nested sort result is
not merged during the gather phase: some products can appear several time
in search involving multiple indices (i.e. we lose grouping), probably
because a product can appear in these specialized indices with same price
value in it.

Is it the expected behavior? If so, is there any workaround?
Do you think there is another approach to consider?

PS : we also considered parent/child but the distribution problem remains
the same (child prices are routed to the parent product's shard). In
addition, to retrieve the best price id for each result product, we
concatenate the price value and the price id in a text field, and execute
the nested sort on this field. But the parent/child sorting seems to only
be valid on numeric fields.

Nicolas

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Martijn,

Thank you very much for your time !

Can your share your search request for the nested sort case? Just want to

look at how you use this feature (for example what sort_mode, what type of
nested filter etc.).

Here is the gist: https://gist.github.com/ncolomer/a58042b363f2efdc5f88
The priceSort field is formatted as following: priceSort: {price}!{priceId}
{price} is left-padded with # up to 5 chars to keep correct ordering.
{priceId} is appended to retrieve product's related price in a second
phase.

What prices ranges are commonly used by end users? Or are the

ranges arbitrarily chosen by the end user?

Price range can be arbitrarily chosen by end users. But the price is one
possible criteria.
Other common queries can also filter on one (or more) price field, implying
adding a term filter for instance.
Filtering on a price range (query-range file in gist) gives us good
response time. But filtering on a particular field (query-filter file in
gist) gives us the bad metrics I mentioned in my previous message.

Does the query also take that long (700ms) without nested sorting?

Without nested sorting, the time drops to ~160ms.
In addition, we observed following (frustrating!) metrics:

  • when we change sort mode from min to max in the nested sort, the
    query is blazing fast (<30ms)
  • when we change sort order from asc to *desc *in the nested sort,
    response time is excellent too (<30ms)
    In both case, the total hits remains the same as the original query.

In the second case getting duplicate products in the search result is

expected, because your logical product is indexed twice, but with different
prices.

Yep, thanks for confirming that!

About the parent / child approach. Sorting based on child values (when

using has_child filter / query) isn't possible at the moment. There is a
workaround that allows you to sort parent document by the child documents
fields:

http://stackoverflow.com/questions/14504180/elasticsearch-sorting-parents-through-child-values/14519947#14519947

Correct me if I'm wrong but Clinton's answer is about nested mapping, not
parent/child mapping?
As I understand, the provided solution implies replacing the nested sorting
with a custom score and take advantage of the score_mode field (sort will
be naturally based on this score)? Do you expect better performances? I'll
give it a try.

Some extra questions:

  • Query resultset is different when I apply a filter on the (nested) sort
    or not, while I use a filtered query: is it normal? Should I systematically
    apply my query filters on the sort?
  • Currently, the statistical facet (see gist) seems to apply on all prices
    of the matching products. Can it be only applied on the price that matched
    (the best one). I didn't see any "min" aggregation mode in the nested facet
    doc http://www.elasticsearch.org/guide/reference/api/search/facets/?
    Maybe using a script field?

Nicolas

2013/4/23 Martijn v Groningen martijn.v.groningen@gmail.com

Hi Nicolas,

Can your share your search request for the nested sort case? Just want to
look at how you use this feature (for example what sort_mode, what type of
nested filter etc.).

What prices ranges are commonly used by end users? Or are the
ranges arbitrarily chosen by the end user? Does the query also take that
long (700ms) without nested sorting?

In the second case getting duplicate products in the search result is
expected, because your logical product is indexed twice, but with different
prices.

About the parent / child approach. Sorting based on child values (when
using has_child filter / query) isn't possible at the moment. There is a
workaround that allows you to sort parent document by the child documents
fields:

http://stackoverflow.com/questions/14504180/elasticsearch-sorting-parents-through-child-values/14519947#14519947

In order to get the best price per returned product, you can build a
secondary request using the multi search api. Each search request in the
multi search api gets the top child document for a returned product in the
previous response (filter by the product id and use that as routing value).

Martijn

On 22 April 2013 09:10, Nicolas Colomer inoskyh@gmail.com wrote:

Hi community !

I'm posting here about a data modeling problem we are facing with
elasticsearch. Here is the situation:

Our business model is relatively simple: we have products that each
contains a (potentially large) set of prices. Both product and price have a
dozen of (indexed) field each.

The number of product remains reasonable with only few Ks while the price
cardinality for each product is very volatile: It might vary from few Ks to
100Ks or more for some products. We currently have ~5K products with a
total of about ~16M prices, and these figures will grow in a near future.

Our search requests are done on products with possible criteria on both
product and product's prices (elasticsearch fits all our filtering /
faceting needs) but with one particularity: we always need to have products
sorted by best price matching the provided criterias.

We tried first to load a price index with duplicated product data for
each price. But as grouping is not implemented in elasticsearch afaik, we
failed fast on this modeling (we only want distinct products in the result).

The next logic approach was then to embed prices into products using
nested mapping to simulate grouping. Therefore, we were able to correctly
sort products on their best price matching criterias using nested sort and
nested filters. But once the whole dataset loaded (our 5K products with
embedded prices), we suffered high response time on some queries (~700ms),
especially with those which do not filter enough the dataset.

We think this can be explained because some price's field values are
unevenly distributed across prices (certain values predominate over others)
and products (some prices are widely spread over products).

So we tried to split our dataset into specialized product indices: we
created several ones in which products contain only prices matching a
particular criteria value (the ones that predominates) and route search
accordingly. Response time was a lot better (~100ms). We could even mix one
or more indices if we need to search on multiple field values.

Nonetheless, in the latter case, we noted that the nested sort result is
not merged during the gather phase: some products can appear several time
in search involving multiple indices (i.e. we lose grouping), probably
because a product can appear in these specialized indices with same price
value in it.

Is it the expected behavior? If so, is there any workaround?
Do you think there is another approach to consider?

PS : we also considered parent/child but the distribution problem remains
the same (child prices are routed to the parent product's shard). In
addition, to retrieve the best price id for each result product, we
concatenate the price value and the price id in a text field, and execute
the nested sort on this field. But the parent/child sorting seems to only
be valid on numeric fields.

Nicolas

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
Met vriendelijke groet,

Martijn van Groningen

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/JJ0ZCDFslWw/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

About the parent / child approach. Sorting based on child values (when

using has_child filter / query) isn't possible at the moment. There is a
workaround that allows you to sort parent document by the child documents
fields:

http://stackoverflow.com/questions/14504180/elasticsearch-sorting-parents-through-child-values/14519947#14519947

Correct me if I'm wrong but Clinton's answer is about nested mapping, not
parent/child mapping?
As I understand, the provided solution implies replacing the nested
sorting with a custom score and take advantage of the score_mode field
(sort will be naturally based on this score)? Do you expect better
performances? I'll give it a try.

I forgot to mention that what he is doing for the nested query can also be
done for the has_child query. Basically the
custom_score query will take the price and use it as score and that allows
you to sort by price.

I think that using the has_child query with the above trick will perform
better, then when using nested sorting (with the 100k inner objects).

Some extra questions:

  • Query resultset is different when I apply a filter on the (nested) sort
    or not, while I use a filtered query: is it normal? Should I systematically
    apply my query filters on the sort?

The filter inside the nested sort doesn't include / exclude docs from the
hits result. It controls what inner objects participates in sorting the
root doc. In practise you would almost always include the child query /
filter from your nested query / filter into the nested sort filter.

  • Currently, the statistical facet (see gist) seems to apply on all prices
    of the matching products. Can it be only applied on the price that matched
    (the best one). I didn't see any "min" aggregation mode in the nested facet
    doc http://www.elasticsearch.org/guide/reference/api/search/facets/?
    Maybe using a script field?

The counts are now based on all matching nested inner objects and there
currently no real way of changing this behaviour. Unless you know the
lowest price, and then you can a facet_filter for that. The script field
doesn't help because it doesn't do anything with faceting, during the the
fetch phase it just adds fields to hits that will be returned in the
response. I think having the facet counts based on root documents instead
of inner objects is useful in your case. So for example you have a term
facet on a field inside a nested inner object and the counts represent the
number of root documents.

Martijn

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.