Best approach for modeling multiple prices per product. Would nested docs work?

Geert_Jan_Brits · December 12, 2012, 4:39pm

I feel some background may be needed:

I've got a searchengine that needs to return hotels (i.e: 1 hotels is 1
doc) . Each hotel has price & availability based on the following combi:
<date,duration, nr of persons, roomtype>. (This means a hotel can have
about 20.000 prices depending on the choosen combo of date,duration, nr of
persons and roomtype in the current implementation)

Requirements:

Return every hotel only once in the resultset.
required in query: each query always requests a price for a
particular <date,duration, nr of persons, roomtype>-combo. This can not be
omitted.
optional in query: filter on hotel specific info (user-rating,
facilities)
optional in query: filter on 'price' (min/max). Here 'price' is the
price related to the required <date,duration, nr of persons, roomtype>-combo
optional in query: sort on price (asc + desc) and/or sort on
hotel-fields like rating.
output: hotelid + price
even better would be if I could return a 'payload' that is attached to
the specific <date,duration, nr of persons, roomtype>-combo besides the
price.

Currently this is implemented in such a way that each <date,duration, nr
of persons, roomtype>-combo has it's own dynamic field. However having docs
with 20.000+ fields just isn't something Lucene, etc. are really
well-suited for. (can go into specifics, but that's not really the point
here, blowing up the Lucene fieldcache while sorting on these fields with
uncontrollable mem-consumption as a result is one of them)

I'm currently investigating some other approaches:

By best bet currently is on modeling prices by (mis)-using the pretty
recent spatial additions to Lucene. I.e: model a <<date,duration, nr of
persons, roomtype>,price> combo as a point where point.x = <date,duration,
nr of persons, roomtype> and point.y = price. Relevant discussion (although
in SOLR context here)
http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-td4026011.html.
With some work with custom scorers everything should work except 7
(returning a payload related to <date,duration, nr of persons, roomtype>).
Or at least that's my current understanding. Please correct if payloads can
be returned per point using the ES spatial stuff)

Reading up on ES however, perhaps using the nested type would work?
(http://www.elasticsearch.org/guide/reference/mapping/nested-type.html)
I envision 1 doc representing 1 hotel with multiple nested docs of format:
{
key: someTransform(<date,duration, nr of persons, roomtype>),
price:
payload: "some other stuff besides price to return related to the
matched <date,duration, nr of persons, roomtype>"
}

Then using the 'has_parent' query
(http://www.elasticsearch.org/guide/reference/query-dsl/has-parent-query.html)
I can (please correct if wrong)

fetch the childdocs for which 'key' matches the
user-supplied: <date,duration, nr of persons, roomtype>
sort on price (of the matching childdoc)
filter on price (of the matching childdoc)
filter on some fields in the parent-doc (the hotel) such as hotel-rating
return the matching child-doc

However, I also want to be able to:
A. return the hotelid (or in other words the id of the parent-document)
B. sort on hotel-fields like hotel-rating.

The question:

are A. and B. supported in the proposed solution above?
would the idea outlined above work, any caveats I should be aware of?
some other (better) way to model this?

Thanks,
Geert-Jan

--

radu_gheorghe · December 15, 2012, 3:14pm

Hello Geert-Jan,

There's a difference between parent and child documents[0] and nested
documents[1]. You can even have both if you need to, although it might not
help your use case.

Both of them are under the hood separate documents, but they're different
from the searching application's POV. The nested document is treated as a
single doc (which makes it easier to search), while parent and child
documents are treated separately (which makes it easier to index/update).
I'm not sure which one's best for you.

I'll try to address your points inline.

On Wed, Dec 12, 2012 at 6:39 PM, Geert-Jan Brits gbrits@gmail.com wrote:

I feel some background may be needed:

I've got a searchengine that needs to return hotels (i.e: 1 hotels is 1
doc) . Each hotel has price & availability based on the following combi:
<date,duration, nr of persons, roomtype>. (This means a hotel can have
about 20.000 prices depending on the choosen combo of date,duration, nr of
persons and roomtype in the current implementation)

Requirements:

Return every hotel only once in the resultset.

required in query: each query always requests a price for a
particular <date,duration, nr of persons, roomtype>-combo. This can not be
omitted.

optional in query: filter on hotel specific info (user-rating,
facilities)

optional in query: filter on 'price' (min/max). Here 'price' is the
price related to the required <date,duration, nr of persons, roomtype>-combo

optional in query: sort on price (asc + desc) and/or sort on
hotel-fields like rating.

output: hotelid + price

even better would be if I could return a 'payload' that is attached to
the specific <date,duration, nr of persons, roomtype>-combo besides the
price.

Currently this is implemented in such a way that each <date,duration, nr
of persons, roomtype>-combo has it's own dynamic field. However having docs
with 20.000+ fields just isn't something Lucene, etc. are really
well-suited for. (can go into specifics, but that's not really the point
here, blowing up the Lucene fieldcache while sorting on these fields with
uncontrollable mem-consumption as a result is one of them)

I'm currently investigating some other approaches:

By best bet currently is on modeling prices by (mis)-using the pretty
recent spatial additions to Lucene. I.e: model a <<date,duration, nr of
persons, roomtype>,price> combo as a point where point.x = <date,duration,
nr of persons, roomtype> and point.y = price. Relevant discussion (although
in SOLR context here)
http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-td4026011.html.
With some work with custom scorers everything should work except 7
(returning a payload related to <date,duration, nr of persons, roomtype>).
Or at least that's my current understanding. Please correct if payloads can
be returned per point using the ES spatial stuff)

Reading up on ES however, perhaps using the nested type would work? (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I envision 1 doc representing 1 hotel with multiple nested docs of format:
{
key: someTransform(<date,duration, nr of persons, roomtype>),
price:
payload: "some other stuff besides price to return related to the
matched <date,duration, nr of persons, roomtype>"
}

Then using the 'has_parent' query (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I can (please correct if wrong)

fetch the childdocs for which 'key' matches the
user-supplied: <date,duration, nr of persons, roomtype>

sort on price (of the matching childdoc)

filter on price (of the matching childdoc)

filter on some fields in the parent-doc (the hotel) such as hotel-rating

return the matching child-doc

Right. You can do that with either parent-child or nested docs.

However, I also want to be able to:
A. return the hotelid (or in other words the id of the parent-document)

In nested documents you get your whole doc, with all sub-documents.

If you search child documents, you have the ID of the parent in the
"_parent" field of each child doc.

B. sort on hotel-fields like hotel-rating.

Again, with nested docs, it's easy. While with parent-child, you have to do
your search in the hotel (parent) docs to have results sorted by fields in
there. If you want to add pricing (children) criteria, you can add a
has_child query in there:

The question:

are A. and B. supported in the proposed solution above?

I think they are, yes.

would the idea outlined above work, any caveats I should be aware of?

has_child filters and queries are run first, and then the "parent"
query/filter. IDs are loaded into memory, and since you say you have lots
of children in general, I would assume such queries will be
memory-intensive.

On the other hand, updating (which implies reindexing) nested docs with
20,000 would also be resource-intensive.

some other (better) way to model this?

You might want to do at least some of the combining of rooms/persons/room
type at query time using script fields:

But that would make your queries slower, of course.

[0] Elasticsearch Platform — Find real-time answers at scale | Elastic
[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--

Geert_Jan_Brits · December 21, 2012, 11:22am

Radu,

I didn't get alerted on your post (pretty annoying google groups stuff if
you ask me)

Thanks for confirming the stuff should work. I'll check your suggestion
about script-fields, really helpful.

Geert-Jan

On Saturday, December 15, 2012 4:14:48 PM UTC+1, Radu Gheorghe wrote:

Hello Geert-Jan,

There's a difference between parent and child documents[0] and nested
documents[1]. You can even have both if you need to, although it might not
help your use case.

Both of them are under the hood separate documents, but they're different
from the searching application's POV. The nested document is treated as a
single doc (which makes it easier to search), while parent and child
documents are treated separately (which makes it easier to index/update).
I'm not sure which one's best for you.

I'll try to address your points inline.

On Wed, Dec 12, 2012 at 6:39 PM, Geert-Jan Brits <gbr...@gmail.com<javascript:>

wrote:

I feel some background may be needed:

I've got a searchengine that needs to return hotels (i.e: 1 hotels is 1
doc) . Each hotel has price & availability based on the following combi:
<date,duration, nr of persons, roomtype>. (This means a hotel can have
about 20.000 prices depending on the choosen combo of date,duration, nr of
persons and roomtype in the current implementation)

Requirements:

Return every hotel only once in the resultset.

required in query: each query always requests a price for a
particular <date,duration, nr of persons, roomtype>-combo. This can not be
omitted.

optional in query: filter on hotel specific info (user-rating,
facilities)

optional in query: filter on 'price' (min/max). Here 'price' is the
price related to the required <date,duration, nr of persons, roomtype>-combo

optional in query: sort on price (asc + desc) and/or sort on
hotel-fields like rating.

output: hotelid + price

even better would be if I could return a 'payload' that is attached
to the specific <date,duration, nr of persons, roomtype>-combo besides the
price.

Currently this is implemented in such a way that each <date,duration, nr
of persons, roomtype>-combo has it's own dynamic field. However having docs
with 20.000+ fields just isn't something Lucene, etc. are really
well-suited for. (can go into specifics, but that's not really the point
here, blowing up the Lucene fieldcache while sorting on these fields with
uncontrollable mem-consumption as a result is one of them)

I'm currently investigating some other approaches:

By best bet currently is on modeling prices by (mis)-using the pretty
recent spatial additions to Lucene. I.e: model a <<date,duration, nr of
persons, roomtype>,price> combo as a point where point.x = <date,duration,
nr of persons, roomtype> and point.y = price. Relevant discussion (although
in SOLR context here)
http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-td4026011.html.
With some work with custom scorers everything should work except 7
(returning a payload related to <date,duration, nr of persons, roomtype>).
Or at least that's my current understanding. Please correct if payloads can
be returned per point using the ES spatial stuff)

Reading up on ES however, perhaps using the nested type would work? (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I envision 1 doc representing 1 hotel with multiple nested docs of format:
{
key: someTransform(<date,duration, nr of persons, roomtype>),
price:
payload: "some other stuff besides price to return related to the
matched <date,duration, nr of persons, roomtype>"
}

Then using the 'has_parent' query (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I can (please correct if wrong)

fetch the childdocs for which 'key' matches the
user-supplied: <date,duration, nr of persons, roomtype>

sort on price (of the matching childdoc)

filter on price (of the matching childdoc)

filter on some fields in the parent-doc (the hotel) such as
hotel-rating

return the matching child-doc

Right. You can do that with either parent-child or nested docs.

However, I also want to be able to:
A. return the hotelid (or in other words the id of the parent-document)

In nested documents you get your whole doc, with all sub-documents.

If you search child documents, you have the ID of the parent in the
"_parent" field of each child doc.

B. sort on hotel-fields like hotel-rating.

Again, with nested docs, it's easy. While with parent-child, you have to
do your search in the hotel (parent) docs to have results sorted by fields
in there. If you want to add pricing (children) criteria, you can add a
has_child query in there:
Elasticsearch Platform — Find real-time answers at scale | Elastic

The question:

are A. and B. supported in the proposed solution above?

I think they are, yes.

would the idea outlined above work, any caveats I should be aware of?

has_child filters and queries are run first, and then the "parent"
query/filter. IDs are loaded into memory, and since you say you have lots
of children in general, I would assume such queries will be
memory-intensive.

On the other hand, updating (which implies reindexing) nested docs with
20,000 would also be resource-intensive.

some other (better) way to model this?

You might want to do at least some of the combining of rooms/persons/room
type at query time using script fields:
Elasticsearch Platform — Find real-time answers at scale | Elastic

But that would make your queries slower, of course.

[0] Elasticsearch Platform — Find real-time answers at scale | Elastic
[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--

Nicolas_G · June 3, 2013, 9:27am

Radu: Did you end up implementing this? I am curious how it worked since I
am thinking on doing something similar.

Thanks,
Nicolas

On Friday, December 21, 2012 7:22:16 PM UTC+8, Geert-Jan Brits wrote:

Radu,

I didn't get alerted on your post (pretty annoying google groups stuff if
you ask me)

Thanks for confirming the stuff should work. I'll check your suggestion
about script-fields, really helpful.

Geert-Jan

On Saturday, December 15, 2012 4:14:48 PM UTC+1, Radu Gheorghe wrote:

Hello Geert-Jan,

There's a difference between parent and child documents[0] and nested
documents[1]. You can even have both if you need to, although it might not
help your use case.

Both of them are under the hood separate documents, but they're different
from the searching application's POV. The nested document is treated as a
single doc (which makes it easier to search), while parent and child
documents are treated separately (which makes it easier to index/update).
I'm not sure which one's best for you.

I'll try to address your points inline.

On Wed, Dec 12, 2012 at 6:39 PM, Geert-Jan Brits gbr...@gmail.comwrote:

I feel some background may be needed:

I've got a searchengine that needs to return hotels (i.e: 1 hotels is 1
doc) . Each hotel has price & availability based on the following combi:
<date,duration, nr of persons, roomtype>. (This means a hotel can have
about 20.000 prices depending on the choosen combo of date,duration, nr of
persons and roomtype in the current implementation)

Requirements:

Return every hotel only once in the resultset.

required in query: each query always requests a price for a
particular <date,duration, nr of persons, roomtype>-combo. This can not be
omitted.

optional in query: filter on hotel specific info (user-rating,
facilities)

optional in query: filter on 'price' (min/max). Here 'price' is the
price related to the required <date,duration, nr of persons, roomtype>-combo

optional in query: sort on price (asc + desc) and/or sort on
hotel-fields like rating.

output: hotelid + price

even better would be if I could return a 'payload' that is attached
to the specific <date,duration, nr of persons, roomtype>-combo besides the
price.

Currently this is implemented in such a way that each <date,duration,
nr of persons, roomtype>-combo has it's own dynamic field. However having
docs with 20.000+ fields just isn't something Lucene, etc. are really
well-suited for. (can go into specifics, but that's not really the point
here, blowing up the Lucene fieldcache while sorting on these fields with
uncontrollable mem-consumption as a result is one of them)

I'm currently investigating some other approaches:

By best bet currently is on modeling prices by (mis)-using the pretty
recent spatial additions to Lucene. I.e: model a <<date,duration, nr of
persons, roomtype>,price> combo as a point where point.x = <date,duration,
nr of persons, roomtype> and point.y = price. Relevant discussion (although
in SOLR context here)
http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-td4026011.html.
With some work with custom scorers everything should work except 7
(returning a payload related to <date,duration, nr of persons, roomtype>).
Or at least that's my current understanding. Please correct if payloads can
be returned per point using the ES spatial stuff)

Reading up on ES however, perhaps using the nested type would work? (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I envision 1 doc representing 1 hotel with multiple nested docs of
format:
{
key: someTransform(<date,duration, nr of persons, roomtype>),
price:
payload: "some other stuff besides price to return related to the
matched <date,duration, nr of persons, roomtype>"
}

Then using the 'has_parent' query (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I can (please correct if wrong)

fetch the childdocs for which 'key' matches the
user-supplied: <date,duration, nr of persons, roomtype>

sort on price (of the matching childdoc)

filter on price (of the matching childdoc)

filter on some fields in the parent-doc (the hotel) such as
hotel-rating

return the matching child-doc

Right. You can do that with either parent-child or nested docs.

However, I also want to be able to:
A. return the hotelid (or in other words the id of the parent-document)

In nested documents you get your whole doc, with all sub-documents.

If you search child documents, you have the ID of the parent in the
"_parent" field of each child doc.

B. sort on hotel-fields like hotel-rating.

Again, with nested docs, it's easy. While with parent-child, you have to
do your search in the hotel (parent) docs to have results sorted by fields
in there. If you want to add pricing (children) criteria, you can add a
has_child query in there:

Elasticsearch Platform — Find real-time answers at scale | Elastic

The question:

are A. and B. supported in the proposed solution above?

I think they are, yes.

would the idea outlined above work, any caveats I should be aware of?

has_child filters and queries are run first, and then the "parent"
query/filter. IDs are loaded into memory, and since you say you have lots
of children in general, I would assume such queries will be
memory-intensive.

On the other hand, updating (which implies reindexing) nested docs with
20,000 would also be resource-intensive.

some other (better) way to model this?

You might want to do at least some of the combining of rooms/persons/room
type at query time using script fields:
Elasticsearch Platform — Find real-time answers at scale | Elastic

But that would make your queries slower, of course.

[0]
Elasticsearch Platform — Find real-time answers at scale | Elastic
[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Donald_Piret · January 31, 2014, 9:38pm

Would be very interested to find out if any of you guys have managed to
implement this successfully. Struggling with a similar issue at the moment.

On Thursday, 13 December 2012 00:39:07 UTC+8, Geert-Jan Brits wrote:

I feel some background may be needed:

I've got a searchengine that needs to return hotels (i.e: 1 hotels is 1
doc) . Each hotel has price & availability based on the following combi:
<date,duration, nr of persons, roomtype>. (This means a hotel can have
about 20.000 prices depending on the choosen combo of date,duration, nr of
persons and roomtype in the current implementation)

Requirements:

Return every hotel only once in the resultset.

required in query: each query always requests a price for a
particular <date,duration, nr of persons, roomtype>-combo. This can not be
omitted.

optional in query: filter on hotel specific info (user-rating,
facilities)

optional in query: filter on 'price' (min/max). Here 'price' is the
price related to the required <date,duration, nr of persons, roomtype>-combo

optional in query: sort on price (asc + desc) and/or sort on
hotel-fields like rating.

output: hotelid + price

even better would be if I could return a 'payload' that is attached to
the specific <date,duration, nr of persons, roomtype>-combo besides the
price.

Currently this is implemented in such a way that each <date,duration, nr
of persons, roomtype>-combo has it's own dynamic field. However having docs
with 20.000+ fields just isn't something Lucene, etc. are really
well-suited for. (can go into specifics, but that's not really the point
here, blowing up the Lucene fieldcache while sorting on these fields with
uncontrollable mem-consumption as a result is one of them)

I'm currently investigating some other approaches:

By best bet currently is on modeling prices by (mis)-using the pretty
recent spatial additions to Lucene. I.e: model a <<date,duration, nr of
persons, roomtype>,price> combo as a point where point.x = <date,duration,
nr of persons, roomtype> and point.y = price. Relevant discussion (although
in SOLR context here)
http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-td4026011.html.
With some work with custom scorers everything should work except 7
(returning a payload related to <date,duration, nr of persons, roomtype>).
Or at least that's my current understanding. Please correct if payloads can
be returned per point using the ES spatial stuff)

Reading up on ES however, perhaps using the nested type would work? (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I envision 1 doc representing 1 hotel with multiple nested docs of format:
{
key: someTransform(<date,duration, nr of persons, roomtype>),
price:
payload: "some other stuff besides price to return related to the
matched <date,duration, nr of persons, roomtype>"
}

Then using the 'has_parent' query (
Elasticsearch Platform — Find real-time answers at scale | Elastic)
I can (please correct if wrong)

fetch the childdocs for which 'key' matches the
user-supplied: <date,duration, nr of persons, roomtype>

sort on price (of the matching childdoc)

filter on price (of the matching childdoc)

filter on some fields in the parent-doc (the hotel) such as hotel-rating

return the matching child-doc

However, I also want to be able to:
A. return the hotelid (or in other words the id of the parent-document)
B. sort on hotel-fields like hotel-rating.

The question:

are A. and B. supported in the proposed solution above?

would the idea outlined above work, any caveats I should be aware of?

some other (better) way to model this?

Thanks,
Geert-Jan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6f6da93c-9639-4d58-9e9e-4ebb70f7883e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Product/price modeling with nested sort problem Elasticsearch	4	1159	July 6, 2017
Price Range Mapping Elasticsearch	2	918	July 19, 2018
Ecommerce Data modelling with nested documents Elasticsearch	9	920	February 16, 2019
Graduated prices on documents Elasticsearch	5	355	May 1, 2020
Schema optimization/alternative for nested objects Elasticsearch	6	488	May 18, 2023

Best approach for modeling multiple prices per product. Would nested docs work?

Best regards, Radu

Best regards, Radu

Best regards, Radu

Related topics

Best regards,
Radu

Best regards,
Radu

Best regards,
Radu